I know how to sum one __m256
to get a single summed value. However, I have 8 vectors like
Input
1: a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7],
.....,
.....,
8: h[0], h[1], h[2], h[3], h[4], a[5], a[6], a[7]
Output
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7],
....,
h[0]+h[1]+h[2]+h[3]+h[4]+h[5]+h[6]+h[7]
My method. Curious if there is a better way.
__m256 sumab = _mm256_hadd_ps(accumulator1, accumulator2);
__m256 sumcd = _mm256_hadd_ps(accumulator3, accumulator4);
__m256 sumef = _mm256_hadd_ps(accumulator5, accumulator6);
__m256 sumgh = _mm256_hadd_ps(accumulator7, accumulator8);
__m256 sumabcd = _mm256_hadd_ps(sumab, sumcd);
__m256 sumefgh = _mm256_hadd_ps(sumef, sumgh);
__m128 sumabcd1 = _mm256_extractf128_ps(sumabcd, 0);
__m128 sumabcd2 = _mm256_extractf128_ps(sumabcd, 1);
__m128 sumefgh1 = _mm256_extractf128_ps(sumefgh, 0);
__m128 sumefgh2 = _mm256_extractf128_ps(sumefgh, 1);
sumabcd1 = _mm_add_ps(sumabcd1, sumabcd2);
sumefgh1 = _mm_add_ps(sumefgh1, sumefgh2);
__m256 result =_mm256_insertf128_ps(_mm256_castps128_ps256(sumabcd1), sumefgh1, 1)
You can use 2x
_mm256_permute2f128_ps
to line up the low and high lanes for a verticalvaddps
. This is instead of 2xextractf128
/insertf128
. This also turns two 128bvaddps xmm
instructions into a single 256bvaddps ymm
.vperm2f128
is as fast as a singlevextractf128
orvinsertf128
on Intel CPUs. It's slow on AMD, though (8 m-ops with 4c latency on Bulldozer-family). Still, not so bad that you need to avoid it, even if you care about perf on AMD. (And one of the permutes can actually be avinsertf128
).This compiles as you'd expect. The second
permute2f128
actually compiles to avinsertf128
, since it's only using the low lane of each input in the same way thatvinsertf128
does. gcc 4.7 and later do this optimization, but only much more recent clang versions do (v3.7). If you care about old clang, do this at the source level.The savings in source lines is bigger than the savings in instructions, because
_mm256_extractf128_ps(sumabcd, 0);
compiles to zero instructions: it's just a cast. No compiler should ever emitvextractf128
with an imm8 other than1
. (vmovdqa xmm/m128, xmm
is always better for getting the low lane). Nice job Intel on wasting an instruction byte on future-proofing that you couldn't use because plain VEX prefixes don't have room to encode longer vectors.The two
vaddps xmm
instructions could run in parallel, so using a singlevaddps ymm
is mostly just a throughput (and code size) gain, not latency.We do shave off 3 cycles from completely eliminating the final
vinsertf128
, though.vhaddps
is 3 uops, 5c latency, and one per 2c throughput. (6c latency on Skylake). Two of those three uops run on the shuffle port. I guess it's basically doing 2xshufps
to generate operands foraddps
.If we can emulate
haddps
(or at least get a horizontal operation we can use) with a singleshufps
/addps
or something, we'd come out ahead. Unfortunately, I don't see how. A single shuffle can only produce one result with data from two vectors, but we need both inputs to verticaladdps
to have data from both vectors.I don't think doing the horizontal sum another way looks promising. Normally, hadd is not a good choice, because the common horizontal-sum use-case only cares about one element of its output. That's not the case here: every element of every
hadd
result is actually used.