I want to add the four components of an SSE register to get a single float. This is how I do it now:
float a[4];
_mm_storeu_ps(a, foo128);
float x = a[0] + a[1] + a[2] + a[3];
Is there an SSE instruction that directly achieves this?
I want to add the four components of an SSE register to get a single float. This is how I do it now:
float a[4];
_mm_storeu_ps(a, foo128);
float x = a[0] + a[1] + a[2] + a[3];
Is there an SSE instruction that directly achieves this?
You could probably use the HADDPS SSE3 instruction, or its compiler intrinsic _mm_hadd_ps,
For example, see http://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.80).aspx
If you have two registers v1 and v2 :
v = _mm_hadd_ps(v1, v2);
v = _mm_hadd_ps(v, v);
Now, v[0] contains the sum of v1's components, and v[1] contains the sum of v2's components.
If you want your code to work on pre-SSE3 CPUs (which do not support _mm_hadd_ps), you might use the following code. It uses more instructions, but decodes to less microops on most CPUs.
__m128 temp = _mm_add_ps(_mm_movehl_ps(foo128, foo128), foo128);
float x;
_mm_store_ss(&x, _mm_add_ss(temp, _mm_shuffle_ps(temp, 1)));
Well, I don't know about any such function, but it can be done using _mm_hadd_ps() two times.