Parallelizing TimSort

In my previous post, I talked about an issue I faced while implementing TimSort. That implementation was primarily a serial implementation with only minor use of parallelization while copying arrays here and there (due to inbuilt language features of Chapel). However, since Chapel is a ‘productive parallel programming language’, I decided to improve on the algorithm.

First, some background on TimSort. TimSort is a hybrid sorting algorithm that combines insertion and merge sort, while taking advantage of naturally occurring sorted sub arrays, known as runs, present in practical data. TimSort searches for such runs, and if they are lesser than a minimum length, binary insertion sort is used to increase the length of the run. The runs are then merged, with the help of a stack (to store runs) and two rules to decide which runs are to be merged; X>Y+Z and Y>Z (where X,Y,Z are the top 3 runs of the stack). If either of the two rules are violated, then Y is merged with the smaller of X or Z.

timsort-algorythm-3rd-screenshot

This is done to optimize merging, by selecting similar size runs. However these rules force the algorithm to be serial. So I threw them out the window, and parallelized the code by using a recursive call and perform merging of pairs of runs simultaneously.

proc _TimMergeRuns(Data:[?adom],runs:[?rdom] ?domtype, lo=rdom.low, hi=rdom.high, comparator:?rec=defaultComparator):domtype{
 if(lo==hi){
   return runs[lo];
 }
 var dom1,dom2: domtype;
 var m = (lo+hi)/2;
 cobegin with (ref dom1, ref dom2){
   dom1=_TimMergeRuns(Data,runs,lo,m, comparator=comparator);
   dom2=_TimMergeRuns(Data,runs,m+1,hi, comparator=comparator);
 }
 return _TimSortMerge(Data,dom1,dom2, comparator=comparator);
}

Here, runs is an array of  domains, which is chapels way of representing valid indices of an array.  Each domain in runs represents a sub array within the Data array that is already sorted. The algorithm recursively splits the array of runs by 2 until only 1 is left (which is used as the escape hatch), then merges the runs using _TimSortMerge(). cobegin is used for task parallelism, and runs each statement within it on a separate thread. Chapel uses “Task Intent” clauses, such as ref  to allow non-local variables to be modified in a parallel region.

The resulting performance after parallelism, with results of parallel merge and quick sorts to compare with:

Parallelized TimSort:
Time taken to sort 131072 bytes (16384 int(64)s)
quickSort (seconds): 0.1807
mergeSort (seconds): 0.92939
timSort (seconds): 0.560166

Original Serial TimSort:
Time taken to sort 131072 bytes (16384 int(64)s)
quickSort (seconds): 0.182228
mergeSort (seconds): 0.571098
timSort (seconds): 1.89295

Without the rules of TimSort for merging, it behaves very similar to merge sort, with the advantage of having larger runs to merge, and optimized merging of skewed runs (previous post, Galloping Phase) .

Code on Github.

#chapel, #parallel-computing, #sort, #timsort

Advertisements