K Length Subsets – Parallelized

Problem Statement: Given an array of integers of size N, generate all the subsets of size K.

Aim: To parallelize subset generation using OpenMP and compare with serial code.

Github code for both serial & parallel: K Length Subsets
Serial code is based on this java implementation of generating K length subsets.

Approach :

The main scope for parallelization is to ‘grow’ multiple parts of the final list simultaneously, as there is no dependency between elements.

Generation of  each subset is divided into individual tasks, shared across multiple threads. Each thread uses its own link list to store the subsets generated by it.
After all threads finish, their lists are combined to one.

#pragma omp parallel
{
	#pragma omp single nowait
	{
		subset(A,n,k,0,0,0,lists);
	}
}
#pragma omp taskwait

Here, #pragma omp single nowait is used as a master thread, that assign tasks to the other threads.

void subset(int A[], int s, int k, int start, int current, boolstr used, LinkList lists[])
{
	if (current == k)
	{
		int *tmp = calloc(k, sizeof(int));
		int f = 0;
		long tu = used;
		for (int i = 0; i < s&& f < k; i++, tu = tu >> 1)
		{
			if (tu & 1 == 1)
			{
				tmp[f] = A[i];
				f++;
			}

		}

		LL_add(&lists[omp_get_thread_num()], tmp);
		return;
	}

	if (start == s)
	{
		return;
	}

#pragma omp task
	{
		subset(A, s, k, start + 1, current + 1, used | (1 << start), lists);
	}
#pragma omp task
	{
		subset(A, s, k, start + 1, current, used, lists);
	}
}

#pragma omp task creates tasks that are distributed across available threads.

To keep track of elements in a subset, an unsigned long long is used as a boolean string. Hence maximum number of elements that can be used is limited by the number of bits in the Boolean string.

Tests:

Input: N=50, K=6
=> 15,890,700 subsets of length k

System Specs Serial Tests Parallel Tests
Clock Ticks Real User Sys Clock Ticks Real User Sys
AMD E1-2100 APU @ 1.00 GHz 1 15510996 15.910s 14.352s 1.322s 19783962 10.527s 18.307s 1.656s
2 Cores 2 Threads 2 15436168 15.721s 14.463s 1.140s 19548649 10.360s 18.325s 1.623s
Ubuntu 3 15327170 15.594s 14.277s 1.216s 19548649 10.837s 18.093s 1.652s
Averages:  15424778 15.742s 19627086  10.575
Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz 1 2990031 3.028s 2.877s 0.152s 8951671 1.290s 8.046s 0.950s
4 Cores 8 Threads 2 2999294 3.037s 2.826s 0.212s 8641548 1.196s 7.839s 0.845s
Ubuntu 3 3119834 3.160s 2.920s 0.240s 8935114 1.232s 7.997s 0.982s
Averages:  – 3036386 3.075s 8842777 1.239
Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz 1 3531250 3.551s 3.156s 0.391s 14843750 3.595s 5.328s 9.531s
4 Cores 8 Threads 2 3546875 3.571s 2.984s 0.578s 13906250 3.552s 5.516s 8.406s
Windows Subsystem for Linux (WSL) 3 3515625 3.575s 2.969s 0.563s 13796875 3.561s 5.094s 8.719s
 Averages:  –  3531250  3.566s  14182291  3.569

The tests show a real time improvement of almost 33% on 2 cores, 2 threads and 60% on 4 cores,8 threads due to parallelism, using Ubuntu Environment.

However, there is no improvement at all on the WSL environment, suggesting that WSL does not fully utilize multi-threading.

#c, #combination, #openmp, #parallel-computing, #subsets, #ubuntu, #wsl

#pragma