I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared
clause
__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];
#pragma omp parallel for shared(xmm0)
for (int i = 0; i < len; i++)
{
__m128d xmm7 = ... result of some operations
xmm0 = _mm_add_pd(xmm0, xmm7);
}
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];
because the atomic
operation is not supported (in VS2010)
__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
__m128d xmm7 = ... result of some operations
#pragma omp atomic
xmm0 = _mm_add_pd(xmm0, xmm7);
}
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];
Does anyone know a clever work-around?
EDIT: I've also tried it using the Parallel Patterns Library just now:
__declspec(align(16)) double res[2];
combinable<__m128d> xmm0_comb([](){return _mm_setzero_pd();});
parallel_for(0, len, 1, [&xmm0_comb, ...](int i)
{
__m128d xmm7 = ... result of some operations
__m128d& xmm0 = xmm0_comb.local();
xmm0 = _mm_add_pd(xmm0, xmm7);
});
__m128d xmm0 = xmm0_comb.combine([](__m128d a, __m128d b){return _mm_add_pd(a, b);});
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];
but the result was disappointing.
You're solving the problem the wrong way. You should be using a reduction instead of atomic operations:
This is a better approach:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < len; i++)
{
__m128d xmm7;// = ... result of some operations
// Collapse to a "double".
_declspec(align(16)) double res[2];
_mm_store_pd(res, xmm7);
// Add to reduction variable.
sum += res[0] + res[1];
}
double final_result = sum;
A reduction is essentially an operation that collapses "reduces" everything to a single variable using an associative operation such as +
.
If you're doing a reduction, always try to use an actual reduction approach. Don't try to cheat it with atomic operations or critical sections.
The reason for this is that atomic/critical section approaches are inherently not scalable as they maintain a long critical path data-dependency. A proper reduction approach reduces this critical path to log(# of threads)
.
The only downside of course is that it breaks floating-point associativity. If that matters, then you're basically stuck with sequentially summing up each iteration.
What you're looking for is a reduction. You can do that as an omp reduction if your compiler supports it (gcc does), or you can roll one yourself by summing into a private xmm for each thread. Below is a simple version doing both:
#include <emmintrin.h>
#include <omp.h>
#include <stdio.h>
int main(int argc, char **argv) {
const int NTHREADS=8;
const int len=100;
__m128d xmm0[NTHREADS];
__m128d xmmreduction = _mm_setzero_pd();
#pragma omp parallel for num_threads(NTHREADS)
for (int i=0; i<NTHREADS; i++)
xmm0[i]= _mm_setzero_pd();
__attribute((aligned(16))) double res[2];
#pragma omp parallel num_threads(NTHREADS) reduction(+:xmmreduction)
{
int tid = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < len; i++)
{
double d = (double)i;
__m128d xmm7 = _mm_set_pd( d, 2.*d );
xmm0[tid] = _mm_add_pd(xmm0[tid], xmm7);
xmmreduction = _mm_add_pd(xmmreduction, xmm7);
}
}
for (int i=1; i<NTHREADS; i++)
xmm0[0] = _mm_add_pd(xmm0[0], xmm0[i]);
_mm_store_pd(res, xmm0[0]);
double final_result = res[0] + res[1];
printf("Expected result = %f\n", 3.0*(len-1)*(len)/2);
printf("Calculated result = %lf\n", final_result);
_mm_store_pd(res, xmmreduction);
final_result = res[0] + res[1];
printf("Calculated result (reduction) = %lf\n", final_result);
return 0;
}
With great help from the people who answered my question I've come up with this:
double final_result = 0.0;
#pragma omp parallel reduction(+:final_result)
{
__declspec(align(16)) double r[2];
__m128d xmm0 = _mm_setzero_pd();
#pragma omp for
for (int i = 0; i < len; i++)
{
__m128d xmm7 = ... result of some operations
xmm0 = _mm_add_pd(xmm0, xmm7);
}
_mm_store_pd(r, xmm0);
final_result += r[0] + r[1];
}
It basically separates the collapse and reduction, performs very well.
Many thanks to all who have helped me!
I guess you can't add your own intrinsics to the compiler, and MS compilers decided to skip inline assembler. Not sure there is an easy solution at all.