I have tried to write a parallel implementation of mergesort using threads and templates. The relevant code is listed below.
I have compared the performance with sort from the C++ STL. My code is 6 times slower than std::sort when no threads are spawned. Playing with the variable maxthreads (and/or FACTOR) I was able to only double the performance, so that in the best case I am 3 times slower than std::sort. I have tried the code on a 16 core multiprocessor machine.
htop shows that the cores are used as expected, but why the lack in performance and I do not feel the parallelism in the overall runtime ?
Is there an error ?
Thank you for a reply.
#define FACTOR 1
static unsigned int maxthreads = FACTOR * std::thread::hardware_concurrency();
unsigned int workers=0;
std::mutex g_mutex;
template <typename T>
std::vector<T>* mergesort_inplace_multithreading(
typename std::vector<T>::iterator* listbegin,
typename std::vector<T>::iterator *listend,
std::vector<T>* listarg)
{
if (*listbegin == *listend)
{
return listarg;
}
else if (*listend == *listbegin + 1)
{
return listarg;
}
else
{
size_t offset = std::distance(*listbegin, *listend)/2;
typename std::vector<T>::iterator listhalf = *listbegin + offset;
g_mutex.lock();
if (::workers <= maxthreads-2 and maxthreads >=2)
{
workers += 2;
g_mutex.unlock();
std::thread first_thread(mergesort_inplace_multithreading<T>, listbegin, &listhalf, listarg);
std::thread second_thread(mergesort_inplace_multithreading<T>, &listhalf, listend, listarg);
first_thread.join();
second_thread.join();
g_mutex.lock();
workers -= 2;
g_mutex.unlock();
}
else
{
g_mutex.unlock();
mergesort_inplace_multithreading<T>(listbegin, &listhalf, listarg);
mergesort_inplace_multithreading<T>(&listhalf, listend, listarg);
}
typename std::vector<T> result;
typename std::vector<T>::iterator lo_sorted_it = *listbegin;
typename std::vector<T>::iterator hi_sorted_it = listhalf;
typename std::vector<T>::iterator lo_sortedend = listhalf;
typename std::vector<T>::iterator hi_sortedend = *listend;
while (lo_sorted_it != lo_sortedend and hi_sorted_it != hi_sortedend)
{
if (*lo_sorted_it <= *hi_sorted_it)
{
result.push_back(*lo_sorted_it);
++lo_sorted_it;
}
else
{
result.push_back(*hi_sorted_it);
++hi_sorted_it;
}
}//end while
if (lo_sorted_it != lo_sortedend)
{
//assert(hi_sorted_it == hi_sortedend);
result.insert(result.end(), lo_sorted_it, lo_sortedend);
}
else
{
//assert(lo_sorted_it == lo_sortedend);
result.insert(result.end(), hi_sorted_it, hi_sortedend);
}
std::copy(result.begin(), result.end(), *listbegin);
return listarg;
}
}
int main()
{
//some tests
}
You don't need a mutex for parallel mergesort. And you certainly don't need to launch two threads for each split of partitions. You launch one thread; the second partition is handled on the current thread; a much better use of thread resources than one thread sitting doing nothing but waiting for two others to finish.
First, the simple test program, sorting 20-million unsigned integers. Note: All programs compiled with Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn), 64bit, posix threads, and optimizations set at O2
Test Program
Parallel Mergesort
We start with something super-basic. We cap the number of concurrent threads to be the reported hardware concurrency from the standard library. Once we reach that limit, we stop issuing new threads and simply recurse on our existing ones. This trivial algorithm has surprisingly decent behavior once distributed across hardware-supported threads.
Output
This looks hopeful, but certainly it it can be improved.
Parallel Merge + Standard Library Sort
The
std::sort
algorithm varies widely from vendor to vendor in its implementation. The main restriction placed by the standard is it must have average complexity O(NlogN). To achieve this on the side of performance, manystd::sort
algorithms are some of the most complex, nutcase-optimized code you'll find in the standard library. I've perused some implementations that have several internal sort characteristics. One such implementation I've seen uses introsort (quicksort until recursion-depth is limited, then heapsort) for larger partitions, and once small partitions are reached, succumbs to a mammoth hand-unrolled 16-slot insertion-sort.The point is, the standard library authors understand that one universal sort-algorithm simply does not fit all. Several are often employed for the task, often working together in harmony. Don't naively think you can beat them; rather, join them by exploiting their hard work.
Modifying our code is straightforward. We use
std::sort
for all partitions smaller than 1025. The rest is identical:After adding our new test case to the test program, we get:
Output
OK. Now we're seeing some impressive stuff. But can we squeeze even more out?
Parallel Merge+Standard Sort w/Limited Recursion
If you think about it, to fully exploit all that hard work
std::sort
was given, we can simply stop recursing once we reach full thread population. If that happens, just sort whatever we have withstd::sort
and merge things together when done. Hard as it is to believe, this will actually, reduce the code complexity. Our algorithm becomes one of simply spreading partitions across cores, each handled bystd::sort
when the time comes:Once again, after adding this to our test loop...
Output
As written, for any partition that is 1024 items or smaller, we just delegate to
std::sort
. If the partitions are larger, we introduce a new thread to handle one side of the split partition, using the current thread to handle the other. Once we saturate the thread limit N, we stop splitting and simply delegate everything tostd::sort
no matter what. In short, we're a multi-thread distribution front-end tostd::sort
.Summary
There are still-more bullets in the chamber that we could fire (using some meta-programming and assuming a fixed concurrency pool number), but that I leave to you.
You can dramatically increase your sorting performance if you simply focus on partitioning, distributing to threads until tapped, utilize a highly optimized sorting algorithm for the floor-partitions, then stitch things together to finish the job. Is there still room for improvement? Certainly. But in its simplest form presented above there is no locking, no mutexes, etc. The difference between the final sample and bare
std::sort
is a whopping 58% improvement on identical data sets on a puny little MacBook Air mid-2011 with 4gB RAM and a duo-core i7 processor. This is impressive, and considering how little code it took to do it, just-plain f'ing awesome.Thank you for the reply.
The mutex only protects the unsigned int workers (a global variable) that keeps track of how many threads was spawned. If a maximum is reached (given by maxthreads) then no more threads are generated. You achive this using the parameter N in mergesort_mt2.
How many cores did your machine have ?
Still the performance only seems to double...