How can I get sum elements (reduction) of float vector using sse intrinsics?
Simple serial code:
void(float *input, float &result, unsigned int NumElems)
{
result = 0;
for(auto i=0; i<NumElems; ++i)
result += input[i];
}
How can I get sum elements (reduction) of float vector using sse intrinsics?
Simple serial code:
void(float *input, float &result, unsigned int NumElems)
{
result = 0;
for(auto i=0; i<NumElems; ++i)
result += input[i];
}
Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.
Note: for the above example
a
must be 16 byte aligned andn
must be a multiple of 4. If the alignment ofa
can not be guaranteed then use_mm_loadu_ps
instead of_mm_load_ps
. Ifn
is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.