Complex data reorganization with vector instructio

I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below:

ABC DEF GHI JKL

becomes

ABBC DEEF GHHI JKKL

Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ?

This needs to be performed repeatedly, so pre-stored masks or constants are allowed.

回答1:

By far your best bet is to use a byte shuffle (pshufb). Shifting within elements isn't enough by itself, since JKL has to move farther to the right than DEF, etc. etc. So you'd need multiple instructions to do different shifts and blend the results.

pshufb (_mm_shuffle_epi8) requires SSSE3, but can do the 12B->16B job in a single fast instruction. It uses a vector as a shuffle control mask. It's the first variable-control shuffle, as well as the first flexible byte shuffle. (SSE2 shuffles all use imm8 control operands, or have fixed data movement (e.g. punpcklbw).

It should be pretty easy to write a loop that loads 16B, shuffles the first 12B into 16B, then stores. Use unaligned loads, and if necessary unaligned stores. Instead of a scalar cleanup loop to handle the last few bytes, load the last 16B of input into a vector and shuffle the last 12B of that. Storing that will overlap with the last store in the loop if the array wasn't a multiple of 12B, but that's fine.

A 128b loop should be able to sustain 16B per clock of outputs, if the input and output are hot in L1 cache. It will probably take some unrolling to make that happen, i.e. something like:

# shuffle mask in xmm5
# rsi=src, rdi=dst,  rcx=src+size (pointer to last full vector)

.loop:
    movdqu   xmm0, [rsi]
    pshufb   xmm0, xmm5
    movdqu   [rdi], xmm0

    movdqu   xmm0, [rsi+12]
    pshufb   xmm0, xmm5
    movdqu   [rdi+16], xmm0

    add      rsi, 24
    add      rdi, 32
    cmp      rsi, rcx       ;; still 9 fused-domain uops in the loop, so it bottlenecks on the frontend.  Need more unroll :/
    jb     .loop

Or use a trick like indexed loads with an index counting up to zero. That would save a uop. (add rsi, 24 / jl .loop). (If you're trying to coax a compiler into doing that, or actually writing asm by hand, make sure it's the loads that use a 2-register addressing mode, because that would stop the stores from micro-fusing.)

AVX2

There are four options to deal with the lane-crossing (a 32B load would have 4B of data for the high result lane in the low source lane):

use 16B load/pshufb/store, same as without AVX. 3 uops, so needs loop unrolling to sustain a 16B store per clock.
double-shuffle: 32B load / vpermd to move bytes between lanes / 32B vpshufb / 32B store. Should saturate the shuffle port with no unrolling, sustaining one 32B store per 2 clocks. (It helps that vpermd can work as a load-and-shuffle, saving uops.)
inserti128: Two 16B loads / 32B vpshufb / 32B store. Can go even faster, but needs a lot of unrolling (and thus cleanup code).
Use loads aligned such that cross-lane data movement isn't needed. (Requires a special-case at the start of a buffer). See BeeOnRope's answer; this is clearly the best way, requiring only one vpshufb ymm, so it obsoletes much of the rest of this answer. We need to do unaligned loads anyway.
(AVX512): vpermb is a full cross-lane byte shuffle with 6-bit indices in the control mask (for the 512b version). The bytes to be shuffled can be a memory operand, so it can be used as a load-and-shuffle. (vpshufb can have its control mask in memory, but can't work as a load-and-shuffle. Presumably because it was designed when 32bit still mattered, where only 8 vector regs are available).

SnB/IvB can do 128b shuffles with one per 0.5c throughput, but since they only have 16B data paths to L1 cache, you might as well just have them (and AMD Bulldozer-family) max out their store throughput with the non-AVX version. They support AVX1 but not AVX2. Don't bother making an AVX1 version; there's nothing to be gained over an SSSE3 version, except maybe avoiding a vzeroupper somewhere. (You could combine two 128b shuffle results with vinsertf128 before storing, which could possibly be a tiny advantage.)

Haswell/Skylake cores only have one shuffle port, so the double-shuffle version that needs two shuffles per 32B of results will bottleneck on that. However, the total fused-domain uop throughput required for that is much lower than for the 16B version, so you don't need to unroll at all to max out the throughput. Still, if you're going to make an unrolled SSSE3 version, you might as well just use that instead of also making an AVX2 version this way. If you're not planning on an non-AVX version, or want to keep it simple, this should give good performance with the least complex source code. Especially if your output buffer is (usually) 32B-aligned.

double-shuffle is also more hyperthreading friendly, because it runs fewer uops. It might still benefit from a small unroll to reduce loop overhead in this case, so it could still saturate the shuffle port when its only getting half the frontend / issue cycles. It also increases the out-of-order window: ~the same number of in-flight loads and stores are accessing twice as much memory. This might help reduce pipeline bubbles from cache misses, but probably makes almost no difference for sequential access like this. Cache-line-crossing 32B loads/stores might be worse than 16B ones. (Aligning the output buffer is a really good idea, and make sure the input buffer is at least 4B-aligned.)

The `vinserti128` version:

The trick is that vinserti128 with a memory source doesn't need the shuffle port: any ALU port will do. So in theory we can do two overlapping 16B loads and one 32B store per cycle. Haswell/Skylake can't sustain that in practice because some stores will run their AGU uop on port 2 or 3, instead of the port 7 dedicated store AGU. Intel's optimization manual (in Section 2.1.3, see x86 tag wiki for links) gives a table of peak vs. sustained throughput for L1, L2, etc on Skylake. Skylake can only sustain ~81B/cycle total to/from L1D cache, vs. the peak of 96B per clock (2 loads and one store). I think the reason there is some stores stealing execution ports from loads, so that will affect us even if our loads are only 16B.

Another major problem: The 4 fused-domain uops per-clock pipeline width: vinserti128 is 2 fused-domain uops, so vmovdqu(16B load) / vinserti128 y,y,m,i / vpshufb / vmovdqu(32B store) is already 5 uops without considering loop overhead. So even with a large unroll, the best we could do is keep the shuffle and load/store ports 4/5ths occupied. That's just slightly below the 81B per clock bottleneck, so that probably won't come into play after all. Still, nearly 32B * 4 / 5c is a solid win over 16B / c.

Don't unroll too much, since we need the frontend to supply 4 uops per clock. The loop buffer will help avoid a bottleneck there if the loop is under 28 uops or so. (Or larger with hyperthreading disabled, and Skylake may have increased it.)

gcc and clang can't unroll the loop even with -funroll-loops, presumably because the number of iterations isn't known at compile time. -funroll-all-loops barely reduces the overhead at all, just putting multiple increments and loop-exit branches in the loop body. So you need to manually unroll the loop for the vinserti128 version to have any advantage.

The code:

Insert and double-shuffle versions, with no unrolling. Neither tested or debugged, but the asm looks good.

You'll want to tidy these up and polish the cleanup code to match your requirements. Probably also benchmark the two versions (or three if you write a non-AVX version).

See the code and asm on the godbolt compiler explorer:

#include <immintrin.h>
#include <assert.h>

// This version won't have much advantage over a 16B loop,
// without significant loop unrolling in the source and expanding the cleanup code to match
void expand12to16_insert128(char *restrict dst, const char *restrict src, size_t src_bytes) {
  // setr: args in little-endian order
  const __m256i byteshuf = _mm256_setr_epi8(0,1,1,2, 3,4,4,5, 6,7,7,8, 9,10,10,11,
                                            0,1,1,2, 3,4,4,5, 6,7,7,8, 9,10,10,11);  
  //const __m256i byteshuf = _mm256_broadcastsi128_si256(byteshuf128);  // gcc is dumb and makes bad code for this, but it does save space


  assert(src_bytes >= 28);  // 28 because the cleanup code reads 4B before the last 24B and then shifts.  That can potentially segfault
    // FIXME: handle this case if needed.
    // maybe with a load that avoids crossing any cache-line boundaries not crossed by the input,
    // and then a VPMASKMOVD conditional store

  const char *lastsrcvec = src + src_bytes - 24;
  for ( ; src < lastsrcvec ; dst += 32, src += 24 ){
#if 1
    __m256i in    = _mm256_castsi128_si256( _mm_loadu_si128((__m128i*)src) );
    __m128i in_hi = _mm_loadu_si128((__m128i*)(src+12) );
    in = _mm256_inserti128_si256(in, in_hi, 1);
#else
    __m128i in_lo = _mm_loadu_si128((__m128i*)(src+0));
    __m128i in_hi = _mm_loadu_si128((__m128i*)(src+12) );
    __m256i in    = _mm256_set_m128i(in_hi, in_lo);  // clang supports this, but gcc doesn't.  Same asm, nicer C syntax
#endif
    __m256i out   = _mm256_shuffle_epi8(in, byteshuf);
    _mm256_storeu_si256((__m256i*)dst, out);
  }

  // grab the last 24B with loads that don't go past the end of the array (i.e. offset by -4)
  // Instead of using a 2nd shuffle mask to shuffle from these offset positions,
  // byte-shift each lane back down to the bottom of the 16B
  // Note that the shift count is a compile time constant: it's the amount of overlap that can vary

  // movq / pinsrd could be useful as a 12B load
  __m256i in    = _mm256_castsi128_si256( _mm_loadu_si128((__m128i*)(lastsrcvec-4)) );
  __m128i in_hi = _mm_loadu_si128((__m128i*)(lastsrcvec-4 + 12) );
  // byte shifting just the hi lane would mean the low lane wouldn't have to be offset
  // but then there'd have to be a load separate from the inserti128
  in = _mm256_inserti128_si256(in, in_hi, 1);  // [ ABC DEF GHI JKL XXXX | same ]

  in = _mm256_bsrli_epi128(in, 4);             // [ 0000 ABC DEF GHI JKL | same ]
  __m256i out   = _mm256_shuffle_epi8(in, byteshuf);

  dst -= (src - lastsrcvec) * 16 / 12;  // calculate the overlap
  // If the caller already needs to calculate dst_bytes, pass that instead of src_bytes
  // Because *3/4 is cheaper than *4/3
  _mm256_storeu_si256((__m256i*)dst, out);

  //return byteshuf;
}


// clang-3.8 miscompiles this to shuffle one shuffle mask with the other, and copy that constant to the whole dst
void expand12to16_doubleshuffle(char *restrict dst, const char *restrict src, size_t src_bytes) {

  assert(src_bytes >= 24);

  // setr: args in little-endian order
  const __m128i byteshuf128 = _mm_setr_epi8(0,1,1,2, 3,4,4,5, 6,7,7,8, 9,10,10,11);
                                            //0,1,1,2, 3,4,4,5, 6,7,7,8, 9,10,10,11);
  const __m256i byteshuf = _mm256_broadcastsi128_si256(byteshuf128);  // gcc is dumb and use a 128b load then vinserti128, instead of a vpbroadcast128i load :/

  // const __m256i lane_adjust_shuf = _mm256_setr_epi32(0,1,2,2, 3,4,5,5);
  // save some space by using a 8->32 pmovzx load.
  const __m256i lane_adjust_shuf = _mm256_cvtepu8_epi32(_mm_setr_epi8(0,1,2,2, 3,4,5,5,
               /* unused padding that isn't optimized away :( */      0,0,0,0, 0,0,0,0));

  const char *lastsrcvec = src + src_bytes - 24;
  for ( ; src < lastsrcvec ; dst += 32, src += 24 ){
    __m256i in    = _mm256_loadu_si256((__m256i*)(src+0));
            in    = _mm256_permutevar8x32_epi32(in, lane_adjust_shuf);
    __m256i out   = _mm256_shuffle_epi8(in, byteshuf);
    _mm256_storeu_si256((__m256i*)dst, out);
  }

  // Use the insert cleanup code because it's easier to load just the last 24B we want
  // slightly modified from the insert128 version to only load the last 24, not 28B
  __m256i in    = _mm256_castsi128_si256( _mm_loadu_si128((__m128i*)(lastsrcvec)) );
  __m128i in_hi = _mm_loadu_si128((__m128i*)(lastsrcvec-4 + 12) );
   // byte shift pshufd instead of bsrli, so the load can fold into it
                                                  // before:    [ LKJ IHG FED CBA XXXX ]
  in_hi = _mm_shuffle_epi32(in_hi, _MM_SHUFFLE(3,3,2,1));    // [ LKJI LKJ IHG FED CBA ]
  in = _mm256_inserti128_si256(in, in_hi, 1);

  __m256i out   = _mm256_shuffle_epi8(in, byteshuf);

  // see the full comments in the other version
  dst -= (src - lastsrcvec) * 16 / 12;  // calculate the overlap
  _mm256_storeu_si256((__m256i*)dst, out);

  //return byteshuf;
}

(clang bug report filed for the mis-compiled shuffles)

The inner loops, from gcc 5.3 -O3 -march=haswell -masm=intel:

#insert version
.L4:
        vmovdqu xmm0, XMMWORD PTR [rsi]   #,* src
        add     rsi, 24   # src,
        vinserti128     ymm0, ymm0, XMMWORD PTR [rsi-12], 0x1     # tmp128, tmp124,
        add     rdi, 32   # dst,
        vpshufb ymm0, ymm0, ymm1  # tmp131, tmp128, tmp157
        vmovdqu YMMWORD PTR [rdi-32], ymm0        #, tmp131
        cmp     rax, rsi  # lastsrcvec, src
        ja      .L4 #,

7 fused-domain uops, should run one iteration per 2 clocks. (i.e. store 16B per cycle). With unrolling can go faster.

#double-shuffle version
.L16:
        vpermd  ymm0, ymm2, YMMWORD PTR [rsi]       # tmp126, D.27207,* src
        add     rsi, 24   # src,
        vpshufb ymm0, ymm0, ymm1  # tmp127, tmp126, D.27202
        add     rdi, 32   # dst,
        vmovdqu YMMWORD PTR [rdi-32], ymm0        #, tmp127
        cmp     rax, rsi  # lastsrcvec, src
        ja      .L16        #,

6 fused-domain uops, should also run one iteration per 2 clocks. This is as fast as it will ever get, though, because of the shuffle port bottleneck. If you're not going to unroll, I'd test both, but I suspect this one will do well.

回答2:

Following on from Peter's solution, for AVX2 it seems like you can get to 32B/cycle (output bytes) by offsetting a 32B load so the 16B boundary falls in the right place, between two groups of 12 bytes:

For example:

byte: 0123456789012345|0123456789012345
load: xxxxAAABBBCCCDDD|EEEFFFGGGHHHxxxx 
pshuf AAAABBBBCCCCDDDD|EEEEFFFFGGGGHHHH

Now no lane-cross movement is needed so with the same unrolling a the original SSE3 solution I think you get to 32 bytes pretty easily - unless cache-line crossing misaligned access hurts you too much.