Static vs Dynamic Scheduling (OpenMP)

Solving problems from Project Euler is a great way to experiment with code. While solving problem 49 (https://projecteuler.net/problem=49) I decided to speed it up by parallelizing it. The problem input was too small to take any noticeable execution time, so I just increased the input size to get more number sequences. I will not be sharing the answer here.

The aim is to find an arithmetic sequence of 3 numbers  such that 1) each number is a prime, and 2) Each number is a permutation of the same set of digits. eg: 1487, 4817, 8147.

First step is finding all prime numbers using the Sieve of Eratosthenes method.
Then find arithmetic progressions within them and check if they are permutations of the same set of digits.

The second step can be parallelized by scanning through multiple chunks of prime numbers simultaneously. However there is no guarantee that each thread will get the same work load. Static scheduling may not provide a significant improvement as threads may join early if they complete their job. This is where dynamic scheduling shines. Threads can steal work form others in order to keep themselves busy and complete execution faster.

Testing the code on an AMD E1-2100 APU with 2 cores and searching within prime numbers from 1000 to 500000 (these patterns don’t exist bellow 1000).

49

The results clearly show the improvement of dynamic over static scheduling in this case.

Code:https://github.com/xSooDx/ProjectEulerCode/blob/master/_49.c

#c, #openmp, #parallel-computing, #project-euler

Advertisements

K Length Subsets – Parallelized

Problem Statement: Given an array of integers of size N, generate all the subsets of size K.

Aim: To parallelize subset generation using OpenMP and compare with serial code.

Github code for both serial & parallel: K Length Subsets
Serial code is based on this java implementation of generating K length subsets.

Approach :

The main scope for parallelization is to ‘grow’ multiple parts of the final list simultaneously, as there is no dependency between elements.

Generation of  each subset is divided into individual tasks, shared across multiple threads. Each thread uses its own link list to store the subsets generated by it.
After all threads finish, their lists are combined to one.

#pragma omp parallel
{
	#pragma omp single nowait
	{
		subset(A,n,k,0,0,0,lists);
	}
}
#pragma omp taskwait

Here, #pragma omp single nowait is used as a master thread, that assign tasks to the other threads.

void subset(int A[], int s, int k, int start, int current, boolstr used, LinkList lists[])
{
	if (current == k)
	{
		int *tmp = calloc(k, sizeof(int));
		int f = 0;
		long tu = used;
		for (int i = 0; i < s&& f < k; i++, tu = tu >> 1)
		{
			if (tu & 1 == 1)
			{
				tmp[f] = A[i];
				f++;
			}

		}

		LL_add(&lists[omp_get_thread_num()], tmp);
		return;
	}

	if (start == s)
	{
		return;
	}

#pragma omp task
	{
		subset(A, s, k, start + 1, current + 1, used | (1 << start), lists);
	}
#pragma omp task
	{
		subset(A, s, k, start + 1, current, used, lists);
	}
}

#pragma omp task creates tasks that are distributed across available threads.

To keep track of elements in a subset, an unsigned long long is used as a boolean string. Hence maximum number of elements that can be used is limited by the number of bits in the Boolean string.

Tests:

Input: N=50, K=6
=> 15,890,700 subsets of length k

System Specs Serial Tests Parallel Tests
Clock Ticks Real User Sys Clock Ticks Real User Sys
AMD E1-2100 APU @ 1.00 GHz 1 15510996 15.910s 14.352s 1.322s 19783962 10.527s 18.307s 1.656s
2 Cores 2 Threads 2 15436168 15.721s 14.463s 1.140s 19548649 10.360s 18.325s 1.623s
Ubuntu 3 15327170 15.594s 14.277s 1.216s 19548649 10.837s 18.093s 1.652s
Averages:  15424778 15.742s 19627086  10.575
Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz 1 2990031 3.028s 2.877s 0.152s 8951671 1.290s 8.046s 0.950s
4 Cores 8 Threads 2 2999294 3.037s 2.826s 0.212s 8641548 1.196s 7.839s 0.845s
Ubuntu 3 3119834 3.160s 2.920s 0.240s 8935114 1.232s 7.997s 0.982s
Averages:  – 3036386 3.075s 8842777 1.239
Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz 1 3531250 3.551s 3.156s 0.391s 14843750 3.595s 5.328s 9.531s
4 Cores 8 Threads 2 3546875 3.571s 2.984s 0.578s 13906250 3.552s 5.516s 8.406s
Windows Subsystem for Linux (WSL) 3 3515625 3.575s 2.969s 0.563s 13796875 3.561s 5.094s 8.719s
 Averages:  –  3531250  3.566s  14182291  3.569

The tests show a real time improvement of almost 33% on 2 cores, 2 threads and 60% on 4 cores,8 threads due to parallelism, using Ubuntu Environment.

However, there is no improvement at all on the WSL environment, suggesting that WSL does not fully utilize multi-threading.

#c, #combination, #openmp, #parallel-computing, #subsets, #ubuntu, #wsl

#pragma

‘Hello OpenMP’: My first OpenMP Program.

My first test of OpenMP after setting up using Visual Studio 2015.

The processor used is an Intel Core i7 4600HQ with 4 cores and 8 threads.

Code

#include <stdio.h>
#include <omp.h>
int main()
{
 #pragma omp parallel
 {
  int ID = omp_get_thread_num();
  printf("hello(%d)", ID);
  printf("world(%d)\n", ID);
  int a;
  scanf_s("%d", &a);
 }
}

Outputs:

Test 1:

hello(0)hello(3)world(3)
hello(5)hello(7)hello(6)world(6)
hello(4)world(4)
hello(2)world(2)
world(5)
world(0)
world(7)
hello(1)world(1)

Test 2:

hello(0)world(0)
hello(6)hello(3)hello(5)world(5)
hello(7)hello(1)world(1)
hello(2)world(2)
world(3)
world(7)
world(6)
hello(4)world(4)

Test 3:

hello(0)world(0)
hello(2)hello(4)hello(5)world(5)
world(2)
hello(1)hello(7)world(4)
world(7)
hello(6)world(6)
world(1)
hello(3)world(3)

This illustrates the interleaving of thread execution.

#c, #openmp, #parallel-computing