Parallelizing TimSort

In my previous post, I talked about an issue I faced while implementing TimSort. That implementation was primarily a serial implementation with only minor use of parallelization while copying arrays here and there (due to inbuilt language features of Chapel). However, since Chapel is a ‘productive parallel programming language’, I decided to improve on the algorithm.

First, some background on TimSort. TimSort is a hybrid sorting algorithm that combines insertion and merge sort, while taking advantage of naturally occurring sorted sub arrays, known as runs, present in practical data. TimSort searches for such runs, and if they are lesser than a minimum length, binary insertion sort is used to increase the length of the run. The runs are then merged, with the help of a stack (to store runs) and two rules to decide which runs are to be merged; X>Y+Z and Y>Z (where X,Y,Z are the top 3 runs of the stack). If either of the two rules are violated, then Y is merged with the smaller of X or Z.

timsort-algorythm-3rd-screenshot

This is done to optimize merging, by selecting similar size runs. However these rules force the algorithm to be serial. So I threw them out the window, and parallelized the code by using a recursive call and perform merging of pairs of runs simultaneously.

proc _TimMergeRuns(Data:[?adom],runs:[?rdom] ?domtype, lo=rdom.low, hi=rdom.high, comparator:?rec=defaultComparator):domtype{
 if(lo==hi){
   return runs[lo];
 }
 var dom1,dom2: domtype;
 var m = (lo+hi)/2;
 cobegin with (ref dom1, ref dom2){
   dom1=_TimMergeRuns(Data,runs,lo,m, comparator=comparator);
   dom2=_TimMergeRuns(Data,runs,m+1,hi, comparator=comparator);
 }
 return _TimSortMerge(Data,dom1,dom2, comparator=comparator);
}

Here, runs is an array of  domains, which is chapels way of representing valid indices of an array.  Each domain in runs represents a sub array within the Data array that is already sorted. The algorithm recursively splits the array of runs by 2 until only 1 is left (which is used as the escape hatch), then merges the runs using _TimSortMerge(). cobegin is used for task parallelism, and runs each statement within it on a separate thread. Chapel uses “Task Intent” clauses, such as ref  to allow non-local variables to be modified in a parallel region.

The resulting performance after parallelism, with results of parallel merge and quick sorts to compare with:

Parallelized TimSort:
Time taken to sort 131072 bytes (16384 int(64)s)
quickSort (seconds): 0.1807
mergeSort (seconds): 0.92939
timSort (seconds): 0.560166

Original Serial TimSort:
Time taken to sort 131072 bytes (16384 int(64)s)
quickSort (seconds): 0.182228
mergeSort (seconds): 0.571098
timSort (seconds): 1.89295

Without the rules of TimSort for merging, it behaves very similar to merge sort, with the advantage of having larger runs to merge, and optimized merging of skewed runs (previous post, Galloping Phase) .

Code on Github.

#chapel, #parallel-computing, #sort, #timsort

Advertisements

Shifting a Sub Array in Parallel

Say you have a very large array of elements, and you want to shift a large sub array of it from position a..b to c..d. This sounds simple enough doesn’t it? Just shift the elements one by one in serial, or several chunks in parallel.

Well it is… as long as you are doing it in serial, and mostly while doing it in parallel as well. However the is an easy to miss issue while parallelizing this. Everything goes perfectly fine unless the sub ranges a..b and c..d share a common region.

array.png

The issue arises due to a race condition in the ‘danger zone’ where elements from the beginning of the sub array can overwrite the elements in the danger zone before they are shifted, corrupting the data.

This may seem like a trivial issue, but it is exactly what I faced while trying to implement the TimSort algorithm using the Chapel Language. The issue only showed up while dealing with huge array sizes, making it agonizing to debug.

TimSort is a hybrid sort combining several sorting techniques like insertion and merge sort. Within, it has a phase known as Galloping Phase where, when initiated while merging two sub arrays, the algorithm starts to move larger chunks of elements per comparison.

in Chapel, the syntax of an array copy is:

array1[a..b]=array2[c..d];

Since Chapel is a parallel programming language, this copy is done in parallel. Hence, if array1 and array2 are the same, and a..b overlaps with c..d, we have a problem.

Solutions:

  1. Make the copy in serial. In the end correctness > performance.
  2. Copy the entire array into a temporary and then to the new location in parallel. But then there would be the cost of synchronizing the threads after copying into the temp.
  3. Copy only the overlapping region in serial, and then the rest in parallel.

In the latter 2 cases, the cost of thread creation should also be taken into account. The size of the sub array to be shifted may not be big enough to pay off. In the end, due to this reason, I chose to keep the shifting to be done in serial.

Code on GitHub

#arrays, #chapel

Chapel

Chapel is an open source programming language designed  for productive parallel computing on large scale systems. It’s development is being led by Cray Inc.

Chapel is designed to run on high-end super computers, but also, due to its portability, runs on mulitcore desktops and laptops, clusters, and the cloud. Chapel supports multithreaded execution using abstractions for data and task parallelism, concurrency and nested parallelism. Chapel also supports object-oriented programming, type inference and generic programing.

To me, coding in chapel feels like a cross between Python and C++. It does not require a main() function, like Python, allowing for rapid prototyping. However for production grade code it uses a more structured programming style by declaring modules and using a main() function as the entry point.

Chapel supports data-parallel execution using forall loops which behaves like OpenMP’s #pragma omp parallel for. Task-parallelism is achieved using coforall loops which create a new thread for every iteration. This is usefull for when each iteration does large amounts of work. It also uses begin and  cobegin to perform tasks in parallel, similar to OpenMP’s tasks and sections.

Chapel is designed to be easy to learn for users of most other languages like C, C++, Java Python, Matlab, etc. It builds on concepts and syntax from other programming languages.

#chapel, #parallel-computing