Parallelizing TimSort

In my previous post, I talked about an issue I faced while implementing TimSort. That implementation was primarily a serial implementation with only minor use of parallelization while copying arrays here and there (due to inbuilt language features of Chapel). However, since Chapel is a ‘productive parallel programming language’, I decided to improve on the algorithm.

First, some background on TimSort. TimSort is a hybrid sorting algorithm that combines insertion and merge sort, while taking advantage of naturally occurring sorted sub arrays, known as runs, present in practical data. TimSort searches for such runs, and if they are lesser than a minimum length, binary insertion sort is used to increase the length of the run. The runs are then merged, with the help of a stack (to store runs) and two rules to decide which runs are to be merged; X>Y+Z and Y>Z (where X,Y,Z are the top 3 runs of the stack). If either of the two rules are violated, then Y is merged with the smaller of X or Z.


This is done to optimize merging, by selecting similar size runs. However these rules force the algorithm to be serial. So I threw them out the window, and parallelized the code by using a recursive call and perform merging of pairs of runs simultaneously.

proc _TimMergeRuns(Data:[?adom],runs:[?rdom] ?domtype, lo=rdom.low, hi=rdom.high, comparator:?rec=defaultComparator):domtype{
   return runs[lo];
 var dom1,dom2: domtype;
 var m = (lo+hi)/2;
 cobegin with (ref dom1, ref dom2){
   dom1=_TimMergeRuns(Data,runs,lo,m, comparator=comparator);
   dom2=_TimMergeRuns(Data,runs,m+1,hi, comparator=comparator);
 return _TimSortMerge(Data,dom1,dom2, comparator=comparator);

Here, runs is an array of  domains, which is chapels way of representing valid indices of an array.  Each domain in runs represents a sub array within the Data array that is already sorted. The algorithm recursively splits the array of runs by 2 until only 1 is left (which is used as the escape hatch), then merges the runs using _TimSortMerge(). cobegin is used for task parallelism, and runs each statement within it on a separate thread. Chapel uses “Task Intent” clauses, such as ref  to allow non-local variables to be modified in a parallel region.

The resulting performance after parallelism, with results of parallel merge and quick sorts to compare with:

Parallelized TimSort:
Time taken to sort 131072 bytes (16384 int(64)s)
quickSort (seconds): 0.1807
mergeSort (seconds): 0.92939
timSort (seconds): 0.560166

Original Serial TimSort:
Time taken to sort 131072 bytes (16384 int(64)s)
quickSort (seconds): 0.182228
mergeSort (seconds): 0.571098
timSort (seconds): 1.89295

Without the rules of TimSort for merging, it behaves very similar to merge sort, with the advantage of having larger runs to merge, and optimized merging of skewed runs (previous post, Galloping Phase) .

Code on Github.

#chapel, #parallel-computing, #sort, #timsort



Chapel is an open source programming language designed  for productive parallel computing on large scale systems. It’s development is being led by Cray Inc.

Chapel is designed to run on high-end super computers, but also, due to its portability, runs on mulitcore desktops and laptops, clusters, and the cloud. Chapel supports multithreaded execution using abstractions for data and task parallelism, concurrency and nested parallelism. Chapel also supports object-oriented programming, type inference and generic programing.

To me, coding in chapel feels like a cross between Python and C++. It does not require a main() function, like Python, allowing for rapid prototyping. However for production grade code it uses a more structured programming style by declaring modules and using a main() function as the entry point.

Chapel supports data-parallel execution using forall loops which behaves like OpenMP’s #pragma omp parallel for. Task-parallelism is achieved using coforall loops which create a new thread for every iteration. This is usefull for when each iteration does large amounts of work. It also uses begin and  cobegin to perform tasks in parallel, similar to OpenMP’s tasks and sections.

Chapel is designed to be easy to learn for users of most other languages like C, C++, Java Python, Matlab, etc. It builds on concepts and syntax from other programming languages.

#chapel, #parallel-computing

Static vs Dynamic Scheduling (OpenMP)

Solving problems from Project Euler is a great way to experiment with code. While solving problem 49 ( I decided to speed it up by parallelizing it. The problem input was too small to take any noticeable execution time, so I just increased the input size to get more number sequences. I will not be sharing the answer here.

The aim is to find an arithmetic sequence of 3 numbers  such that 1) each number is a prime, and 2) Each number is a permutation of the same set of digits. eg: 1487, 4817, 8147.

First step is finding all prime numbers using the Sieve of Eratosthenes method.
Then find arithmetic progressions within them and check if they are permutations of the same set of digits.

The second step can be parallelized by scanning through multiple chunks of prime numbers simultaneously. However there is no guarantee that each thread will get the same work load. Static scheduling may not provide a significant improvement as threads may join early if they complete their job. This is where dynamic scheduling shines. Threads can steal work form others in order to keep themselves busy and complete execution faster.

Testing the code on an AMD E1-2100 APU with 2 cores and searching within prime numbers from 1000 to 500000 (these patterns don’t exist bellow 1000).


The results clearly show the improvement of dynamic over static scheduling in this case.


#c, #openmp, #parallel-computing, #project-euler

K Length Subsets – Parallelized

Problem Statement: Given an array of integers of size N, generate all the subsets of size K.

Aim: To parallelize subset generation using OpenMP and compare with serial code.

Github code for both serial & parallel: K Length Subsets
Serial code is based on this java implementation of generating K length subsets.

Approach :

The main scope for parallelization is to ‘grow’ multiple parts of the final list simultaneously, as there is no dependency between elements.

Generation of  each subset is divided into individual tasks, shared across multiple threads. Each thread uses its own link list to store the subsets generated by it.
After all threads finish, their lists are combined to one.

#pragma omp parallel
	#pragma omp single nowait
#pragma omp taskwait

Here, #pragma omp single nowait is used as a master thread, that assign tasks to the other threads.

void subset(int A[], int s, int k, int start, int current, boolstr used, LinkList lists[])
	if (current == k)
		int *tmp = calloc(k, sizeof(int));
		int f = 0;
		long tu = used;
		for (int i = 0; i < s&& f < k; i++, tu = tu >> 1)
			if (tu & 1 == 1)
				tmp[f] = A[i];


		LL_add(&lists[omp_get_thread_num()], tmp);

	if (start == s)

#pragma omp task
		subset(A, s, k, start + 1, current + 1, used | (1 << start), lists);
#pragma omp task
		subset(A, s, k, start + 1, current, used, lists);

#pragma omp task creates tasks that are distributed across available threads.

To keep track of elements in a subset, an unsigned long long is used as a boolean string. Hence maximum number of elements that can be used is limited by the number of bits in the Boolean string.


Input: N=50, K=6
=> 15,890,700 subsets of length k

System Specs Serial Tests Parallel Tests
Clock Ticks Real User Sys Clock Ticks Real User Sys
AMD E1-2100 APU @ 1.00 GHz 1 15510996 15.910s 14.352s 1.322s 19783962 10.527s 18.307s 1.656s
2 Cores 2 Threads 2 15436168 15.721s 14.463s 1.140s 19548649 10.360s 18.325s 1.623s
Ubuntu 3 15327170 15.594s 14.277s 1.216s 19548649 10.837s 18.093s 1.652s
Averages:  15424778 15.742s 19627086  10.575
Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz 1 2990031 3.028s 2.877s 0.152s 8951671 1.290s 8.046s 0.950s
4 Cores 8 Threads 2 2999294 3.037s 2.826s 0.212s 8641548 1.196s 7.839s 0.845s
Ubuntu 3 3119834 3.160s 2.920s 0.240s 8935114 1.232s 7.997s 0.982s
Averages:  – 3036386 3.075s 8842777 1.239
Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz 1 3531250 3.551s 3.156s 0.391s 14843750 3.595s 5.328s 9.531s
4 Cores 8 Threads 2 3546875 3.571s 2.984s 0.578s 13906250 3.552s 5.516s 8.406s
Windows Subsystem for Linux (WSL) 3 3515625 3.575s 2.969s 0.563s 13796875 3.561s 5.094s 8.719s
 Averages:  –  3531250  3.566s  14182291  3.569

The tests show a real time improvement of almost 33% on 2 cores, 2 threads and 60% on 4 cores,8 threads due to parallelism, using Ubuntu Environment.

However, there is no improvement at all on the WSL environment, suggesting that WSL does not fully utilize multi-threading.

#c, #combination, #openmp, #parallel-computing, #subsets, #ubuntu, #wsl


‘Hello OpenMP’: My first OpenMP Program.

My first test of OpenMP after setting up using Visual Studio 2015.

The processor used is an Intel Core i7 4600HQ with 4 cores and 8 threads.


#include <stdio.h>
#include <omp.h>
int main()
 #pragma omp parallel
  int ID = omp_get_thread_num();
  printf("hello(%d)", ID);
  printf("world(%d)\n", ID);
  int a;
  scanf_s("%d", &a);


Test 1:


Test 2:


Test 3:


This illustrates the interleaving of thread execution.

#c, #openmp, #parallel-computing