Most efficient way to get a __m256 of horizontal s

I know how to sum one __m256 to get a single summed value. However, I have 8 vectors like Input

1: a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7],
.....,
.....,
8: h[0], h[1], h[2], h[3], h[4], a[5], a[6], a[7]

Output

a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7], 
 ...., 
h[0]+h[1]+h[2]+h[3]+h[4]+h[5]+h[6]+h[7]

My method. Curious if there is a better way.

            __m256 sumab = _mm256_hadd_ps(accumulator1, accumulator2);
            __m256 sumcd = _mm256_hadd_ps(accumulator3, accumulator4);

            __m256 sumef = _mm256_hadd_ps(accumulator5, accumulator6);
            __m256 sumgh = _mm256_hadd_ps(accumulator7, accumulator8);

            __m256 sumabcd = _mm256_hadd_ps(sumab, sumcd);
            __m256 sumefgh = _mm256_hadd_ps(sumef, sumgh);

            __m128 sumabcd1 = _mm256_extractf128_ps(sumabcd, 0);
            __m128 sumabcd2 = _mm256_extractf128_ps(sumabcd, 1);
            __m128 sumefgh1 = _mm256_extractf128_ps(sumefgh, 0);
            __m128 sumefgh2 = _mm256_extractf128_ps(sumefgh, 1);

            sumabcd1 = _mm_add_ps(sumabcd1, sumabcd2);
            sumefgh1 = _mm_add_ps(sumefgh1, sumefgh2);

 __m256 result =_mm256_insertf128_ps(_mm256_castps128_ps256(sumabcd1), sumefgh1, 1)

You can use 2x _mm256_permute2f128_ps to line up the low and high lanes for a vertical vaddps. This is instead of 2x extractf128 / insertf128. This also turns two 128b vaddps xmm instructions into a single 256b vaddps ymm.

vperm2f128 is as fast as a single vextractf128 or vinsertf128 on Intel CPUs. It's slow on AMD, though (8 m-ops with 4c latency on Bulldozer-family). Still, not so bad that you need to avoid it, even if you care about perf on AMD. (And one of the permutes can actually be a vinsertf128).

__m256 hsum8(__m256 a, __m256 b, __m256 c, __m256 d,
             __m256 e, __m256 f, __m256 g, __m256 h)
{
    // a = [ A7 A6 A5 A4 | A3 A2 A1 A0 ]
    __m256 sumab = _mm256_hadd_ps(a, b);
    __m256 sumcd = _mm256_hadd_ps(c, d);

    __m256 sumef = _mm256_hadd_ps(e, f);
    __m256 sumgh = _mm256_hadd_ps(g, h);

    __m256 sumabcd = _mm256_hadd_ps(sumab, sumcd);  // [ D7:4 ... A7:4 | D3:0 ... A3:0 ]
    __m256 sumefgh = _mm256_hadd_ps(sumef, sumgh);  // [ H7:4 ... E7:4 | H3:0 ... E3:0 ]

    __m256 sum_hi = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x31);  // [ H7:4 ... E7:4 | D7:4 ... A7:4 ]
    __m256 sum_lo = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x20);  // [ H3:0 ... E3:0 | D3:0 ... A3:0 ]

    __m256 result = _mm256_add_ps(sum_hi, sum_lo);
    return result;
}

This compiles as you'd expect. The second permute2f128 actually compiles to a vinsertf128, since it's only using the low lane of each input in the same way that vinsertf128 does. gcc 4.7 and later do this optimization, but only much more recent clang versions do (v3.7). If you care about old clang, do this at the source level.

The savings in source lines is bigger than the savings in instructions, because _mm256_extractf128_ps(sumabcd, 0); compiles to zero instructions: it's just a cast. No compiler should ever emit vextractf128 with an imm8 other than 1. (vmovdqa xmm/m128, xmm is always better for getting the low lane). Nice job Intel on wasting an instruction byte on future-proofing that you couldn't use because plain VEX prefixes don't have room to encode longer vectors.

The two vaddps xmm instructions could run in parallel, so using a single vaddps ymm is mostly just a throughput (and code size) gain, not latency.

We do shave off 3 cycles from completely eliminating the final vinsertf128, though.

vhaddps is 3 uops, 5c latency, and one per 2c throughput. (6c latency on Skylake). Two of those three uops run on the shuffle port. I guess it's basically doing 2x shufps to generate operands for addps.

If we can emulate haddps (or at least get a horizontal operation we can use) with a single shufps/addps or something, we'd come out ahead. Unfortunately, I don't see how. A single shuffle can only produce one result with data from two vectors, but we need both inputs to vertical addps to have data from both vectors.

I don't think doing the horizontal sum another way looks promising. Normally, hadd is not a good choice, because the common horizontal-sum use-case only cares about one element of its output. That's not the case here: every element of every hadd result is actually used.