I would like to horizontally sum the components of a __m256
vector using AVX instructions.
In SSE I could use
_mm_hadd_ps(xmm,xmm);
_mm_hadd_ps(xmm,xmm);
to get the result at the first component of the vector, but this does not scale with the 256 bit version of the function (_mm256_hadd_ps
).
What is the best way to compute the horizontal sum of a __m256
vector?
This can be done with the following code:
but there might be a better solution.
This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer: