I have the following code in my program and I want to accelerate it using OpenMP.
...
for(i=curr_index; i < curr_index + rx_size; i+=2){
int64_t tgt = rcvq[i];
int64_t src = rcvq[i+1];
if (!TEST(tgt)) {
pred[tgt] = src;
newq[newq_count++] = tgt;
}
}
Currently, I have a version as follows:
...
chunk = rx_sz / omp_nthreads;
#pragma omp parallel for num_threads(omp_nthreads)
for (ii = 0; ii < omp_nthreads; ii++) {
int start = curr_index + ii * chunk;
for (index = start; index < start + chunk; index +=2) {
int64_t tgt = rcvq[index];
int64_t src = rcvq[index+1];
if (!TEST(tgt)) {
pred[tgt] = src;
#pragma omp critical
newq[newq_count++] = tgt;
}
}
}
When I run the OpenMP version, I see a big performance degradation compared to the original version. I think the issue could be because of "omp critical" which prevents parallel processing. I want to know what could be enhanced with my code, so I could get better performance over the serial version. In the code, rx_sz is always a multiple of omp_nthreads.
I'm pretty sure omp critical section limiting your performance at this point.
I'd recommend you to collect the results into separate buffers/vectors and merge them after the parallel processing is done (of course, if the order doesn't matter for you)
Yes, the critical section is limiting your performance. You should collect the results locally for each thread and then merge them.
With this approach, all threads work in parallel on the merging and there is only minimal contention on the
atomic capture
. Note that you can also make a simple version withatomic capture
, that is more efficient than the critical section, but will still become a bottleneck quickly:newq
in any of the versionscritical
-version you posted is wrong, becauseindex
(defined in an outer scope) is implictly shared among threads.rcvq
. Otherwise, you get a race condition onpred[tgt] = src;
.pragma omp for
loop.The other answer gets the idea right. However, it is C++, not, as tagged, C. There is also a subtle yet significant performance issue with using
std::vector<std::vector<>>
. Usually a vector is implemented with three pointers, a total of 24 byte. Uponpush_back
one of the pointers is incremented. This means that, a) the pointers of local vectors from multiple threads reside on the same cache line, and b) on every successfulTEST
,push_back
reads and writes to a cache line that is used by other thread(s). This cache-line will have to move around between cores all the time, greatly limiting the scalability of this approach. This is called false sharing.I implemented a small test based on the other answer giving the following performance:
0.99 s
- single thread1.58 s
- two threads on two neighbouring cores of the same socket2.13 s
- two threads on two cores of different sockets0.99 s
- two threads sharing a single core0.62 s
- 24 threads on two socketsWhereas above C version scales much better:
0.46 s
- single thread (not really comparable C vs C++)0.24 s
- two threads0.04 s
- 24 threads