I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions.
Let's assume I had a datatype float4
that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code:
float4 sum4 = 0.0f; //sets all four values to zero
#pragma omp parallel
{
float4 sum_private = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i+=4) {
float4 val = float4().load(&array[i]) //load four floats into a SSE register
sum_private4 += val; //sum_private4 = _mm_addps(val,sum_private4)
}
#pragma omp critical
sum4 += sum_private;
}
float sum = horizontal_sum(sum4); //sum4[0] + sum4[1] + sum4[2] + sum4[3]
But atomic is faster than critical in general and my instinct tells me SSE/AVX operations should be atomic (even if OpenMP does not support it). Is this a limitation of OpenMP? Could I use for example e.g. Intel Threading Building Blocks or pthreads to do this as an atomic operation?
Edit: Based on Jim Cownie's comment I created a new function which is the best solution. I verified that it gives the correct result.
float sum = 0.0f;
#pragma omp parallel reduction(+:sum)
{
Vec4f sum4 = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i+=4) {
Vec4f val = Vec4f().load(&A[i]); //load four floats into a SSE register
sum4 += val; //sum4 = _mm_addps(val,sum4)
}
sum += horizontal_add(sum4);
}
Edit: based on comments Jim Cownie and comments by Mystical at this thread OpenMP atomic _mm_add_pd I realize now that the reduction implementation in OpenMP does not necessarily use atomic operators and it's best to rely on OpenMP's reduction implementation rather than try to do it with atomic.