http://pastebin.com/YMS4ehRj
^ This is my implementation of parallel merge sort. Basically what I do is, For every split, the first half is handled by a thread whereas the second half is sequential (i.e.) say we have an array of 9 elements, [0..4] is handled by Thread 1, [0..1] is handled Thread 2, [5..6] is handled by thread 3 (Look at the source code for clarification).
Everything else stays the same, like Merging. But the problem is, this runs much slower than merge sort, even slower than normal bubble sort! And I mean for an array of 25000 int's. I'm not sure where the bottleneck is: Is it the mutex locking? Is it the merging?
Any ideas on how to make this faster?
You are creating a large number of threads, each of which then only does very little work. To sort 25000 ints you create about 12500 threads that spawn other threads and merge their results, and about 12500 threads that only sort two ints each.
The overhead from creating all those threads far outweighs the gains you get from parallel processing.
To avoid this, make sure that each thread has a reasonable amount of work to do. For example, if one thread finds that it only has to sort <10000 numbers it can simply sort them itself with a normal merge sort, instead of spawning new threads.
Given you have a finite number of cores on your system, why would you want to create more threads than cores?
Also, it isn't clear why you need to have a mutex at all. As far as I can tell from a quick scan, the program doesn't need to share the threads[lthreadcnt] outside the local function. Just use a local variable and you should be golden.
Your parallelism is too fine-grained, there are too many threads which are doing just small work. You can define a threshold so that arrays which have smaller sizes than the threshold are sequentially sorted. Be careful about the number of spawned threads, a good indication is that the number of threads are usually not much bigger than the number of cores.
Because much of your computation is in merge
function, another suggestion is using Divide-and-Conquer Merge instead of simple merge. The advantage is two-fold: the running time is smaller and it is easy to spawn threads for running parallel merging. You can get the idea of how to implement parallel merge here: http://drdobbs.com/high-performance-computing/229204454. They also have an article about Parallel Merge Sort which might be helpful for you: http://drdobbs.com/high-performance-computing/229400239