Atomic operators, SSE/AVX, and OpenMP

I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions.

Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code:

float4 sum4 = 0.0f; //sets all four values to zero
#pragma omp parallel
{
    float4 sum_private = 0.0f;
    #pragma omp for nowait
    for(int i=0; i<N; i+=4) {
        float4 val = float4().load(&array[i]) //load four floats into a SSE register
        sum_private4 += val; //sum_private4 = _mm_addps(val,sum_private4)
    }
    #pragma omp critical
    sum4 += sum_private;
}
float sum = horizontal_sum(sum4); //sum4[0] + sum4[1] + sum4[2] + sum4[3]

But atomic is faster than critical in general and my instinct tells me SSE/AVX operations should be atomic (even if OpenMP does not support it). Is this a limitation of OpenMP? Could I use for example e.g. Intel Threading Building Blocks or pthreads to do this as an atomic operation?

Edit: Based on Jim Cownie's comment I created a new function which is the best solution. I verified that it gives the correct result.

float sum = 0.0f;
#pragma omp parallel reduction(+:sum)
{
    Vec4f sum4 = 0.0f;  
    #pragma omp for nowait
    for(int i=0; i<N; i+=4) {
        Vec4f val = Vec4f().load(&A[i]); //load four floats into a SSE register
        sum4 += val; //sum4 = _mm_addps(val,sum4)
    }
    sum += horizontal_add(sum4);
}

Edit: based on comments Jim Cownie and comments by Mystical at this thread OpenMP atomic _mm_add_pd I realize now that the reduction implementation in OpenMP does not necessarily use atomic operators and it's best to rely on OpenMP's reduction implementation rather than try to do it with atomic.

SSE & AVX in general are not atomic operations (but multiword CAS would sure be sweet).

You can use the combinable class template in tbb or ppl for more general purpose reductions and thread local initializations, think of it as a synchronized hash table indexed by thread id; it works just fine with OpenMP and doesn't spin up any extra threads on its own.

You can find examples on the tbb site and on msdn.

Regarding the comment, consider this code:

x = x + 5

You should really think of it as the following particularly when multiple threads are involved:

while( true ){
    oldValue = x
    desiredValue = oldValue + 5
    //this conditional is the atomic compare and swap
    if( x == oldValue )
       x = desiredValue
       break;
}

make sense?