I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below:
ABC DEF GHI JKL
becomes
ABBC DEEF GHHI JKKL
Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ?
This needs to be performed repeatedly, so pre-stored masks or constants are allowed.
By far your best bet is to use a byte shuffle (
pshufb
). Shifting within elements isn't enough by itself, sinceJKL
has to move farther to the right thanDEF
, etc. etc. So you'd need multiple instructions to do different shifts and blend the results.pshufb
(_mm_shuffle_epi8
) requires SSSE3, but can do the 12B->16B job in a single fast instruction. It uses a vector as a shuffle control mask. It's the first variable-control shuffle, as well as the first flexible byte shuffle. (SSE2 shuffles all use imm8 control operands, or have fixed data movement (e.g.punpcklbw
).It should be pretty easy to write a loop that loads 16B, shuffles the first 12B into 16B, then stores. Use unaligned loads, and if necessary unaligned stores. Instead of a scalar cleanup loop to handle the last few bytes, load the last 16B of input into a vector and shuffle the last 12B of that. Storing that will overlap with the last store in the loop if the array wasn't a multiple of 12B, but that's fine.
A 128b loop should be able to sustain 16B per clock of outputs, if the input and output are hot in L1 cache. It will probably take some unrolling to make that happen, i.e. something like:
Or use a trick like indexed loads with an index counting up to zero. That would save a uop. (
add rsi, 24
/jl .loop
). (If you're trying to coax a compiler into doing that, or actually writing asm by hand, make sure it's the loads that use a 2-register addressing mode, because that would stop the stores from micro-fusing.)AVX2
There are four options to deal with the lane-crossing (a 32B load would have 4B of data for the high result lane in the low source lane):
vpermd
can work as a load-and-shuffle, saving uops.)Use loads aligned such that cross-lane data movement isn't needed. (Requires a special-case at the start of a buffer). See BeeOnRope's answer; this is clearly the best way, requiring only one
vpshufb ymm
, so it obsoletes much of the rest of this answer. We need to do unaligned loads anyway.(AVX512):
vpermb
is a full cross-lane byte shuffle with 6-bit indices in the control mask (for the 512b version). The bytes to be shuffled can be a memory operand, so it can be used as a load-and-shuffle. (vpshufb
can have its control mask in memory, but can't work as a load-and-shuffle. Presumably because it was designed when 32bit still mattered, where only 8 vector regs are available).SnB/IvB can do 128b shuffles with one per 0.5c throughput, but since they only have 16B data paths to L1 cache, you might as well just have them (and AMD Bulldozer-family) max out their store throughput with the non-AVX version. They support AVX1 but not AVX2. Don't bother making an AVX1 version; there's nothing to be gained over an SSSE3 version, except maybe avoiding a
vzeroupper
somewhere. (You could combine two 128b shuffle results withvinsertf128
before storing, which could possibly be a tiny advantage.)Haswell/Skylake cores only have one shuffle port, so the double-shuffle version that needs two shuffles per 32B of results will bottleneck on that. However, the total fused-domain uop throughput required for that is much lower than for the 16B version, so you don't need to unroll at all to max out the throughput. Still, if you're going to make an unrolled SSSE3 version, you might as well just use that instead of also making an AVX2 version this way. If you're not planning on an non-AVX version, or want to keep it simple, this should give good performance with the least complex source code. Especially if your output buffer is (usually) 32B-aligned.
double-shuffle is also more hyperthreading friendly, because it runs fewer uops. It might still benefit from a small unroll to reduce loop overhead in this case, so it could still saturate the shuffle port when its only getting half the frontend / issue cycles. It also increases the out-of-order window: ~the same number of in-flight loads and stores are accessing twice as much memory. This might help reduce pipeline bubbles from cache misses, but probably makes almost no difference for sequential access like this. Cache-line-crossing 32B loads/stores might be worse than 16B ones. (Aligning the output buffer is a really good idea, and make sure the input buffer is at least 4B-aligned.)
The
vinserti128
version:The trick is that
vinserti128
with a memory source doesn't need the shuffle port: any ALU port will do. So in theory we can do two overlapping 16B loads and one 32B store per cycle. Haswell/Skylake can't sustain that in practice because some stores will run their AGU uop on port 2 or 3, instead of the port 7 dedicated store AGU. Intel's optimization manual (in Section 2.1.3, see x86 tag wiki for links) gives a table of peak vs. sustained throughput for L1, L2, etc on Skylake. Skylake can only sustain ~81B/cycle total to/from L1D cache, vs. the peak of 96B per clock (2 loads and one store). I think the reason there is some stores stealing execution ports from loads, so that will affect us even if our loads are only 16B.Another major problem: The 4 fused-domain uops per-clock pipeline width:
vinserti128
is 2 fused-domain uops, sovmovdqu
(16B load) /vinserti128 y,y,m,i
/vpshufb
/vmovdqu
(32B store) is already 5 uops without considering loop overhead. So even with a large unroll, the best we could do is keep the shuffle and load/store ports 4/5ths occupied. That's just slightly below the 81B per clock bottleneck, so that probably won't come into play after all. Still, nearly 32B * 4 / 5c is a solid win over 16B / c.Don't unroll too much, since we need the frontend to supply 4 uops per clock. The loop buffer will help avoid a bottleneck there if the loop is under 28 uops or so. (Or larger with hyperthreading disabled, and Skylake may have increased it.)
gcc and clang can't unroll the loop even with
-funroll-loops
, presumably because the number of iterations isn't known at compile time.-funroll-all-loops
barely reduces the overhead at all, just putting multiple increments and loop-exit branches in the loop body. So you need to manually unroll the loop for thevinserti128
version to have any advantage.The code:
Insert and double-shuffle versions, with no unrolling. Neither tested or debugged, but the asm looks good.
You'll want to tidy these up and polish the cleanup code to match your requirements. Probably also benchmark the two versions (or three if you write a non-AVX version).
See the code and asm on the godbolt compiler explorer:
(clang bug report filed for the mis-compiled shuffles)
The inner loops, from gcc 5.3 -O3 -march=haswell -masm=intel:
7 fused-domain uops, should run one iteration per 2 clocks. (i.e. store 16B per cycle). With unrolling can go faster.
6 fused-domain uops, should also run one iteration per 2 clocks. This is as fast as it will ever get, though, because of the shuffle port bottleneck. If you're not going to unroll, I'd test both, but I suspect this one will do well.
Following on from Peter's solution, for AVX2 it seems like you can get to 32B/cycle (output bytes) by offsetting a 32B load so the 16B boundary falls in the right place, between two groups of 12 bytes:
For example:
Now no lane-cross movement is needed so with the same unrolling a the original SSE3 solution I think you get to 32 bytes pretty easily - unless cache-line crossing misaligned access hurts you too much.