Consider the following loop in x86:
; on entry, rdi has the number of iterations
.top:
; some magic happens here to calculate a result in rax
mov [array + rdi * 8], rax ; store result in output array
dec rdi
jnz .top
It's straightforward: something calculates a result in rax
(not shown) and then we store the result into an array, in reverse order as we index with rdi
.
I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write to memory).
As long as the loop count in rdi
is limited, I could use the ample space (512 bytes) provided by the ymm
regs to save the values instead, but it seems awkward to actually do this, since you can't "index" an arbitrary register.
One approach would be to always shuffle the entire "array" of ymm
registers by one element and then insert the element in the newly freed up position.
Something like this:
vpermq ymm3, ymm3, 10_01_00_11b ; left rotate ymm by qword
vpermq ymm2, ymm2, 10_01_00_11b ; left rotate ymm by qword
vpermq ymm1, ymm1, 10_01_00_11b ; left rotate ymm by qword
vpermq ymm0, ymm0, 10_01_00_11b ; left rotate ymm by qword
vblenddd ymm3, ymm3, ymm2, 3 ; promote one qword of ymm2 to ymm3
vblenddd ymm2, ymm2, ymm1, 3 ; promote one qword of ymm1 to ymm2
vblenddd ymm1, ymm1, ymm0, 3 ; promote one qword of ymm0 to ymm1
pinsrq xmm0, rax, 0 ; playing with mixed-VEX mode fire (see Peter's answer)
This shows only handling four of the 16 registers, so evidently to do all 16 it will be a lot of code (32 instructions).
Is there a better way?
Unpredictable branches are undesirable, but we can still consider solutions that use them.
A few options you could consider:
Unroll
If you unroll your loop (which necessarily has a limited number of iterations since the storage available in
ymm
registers is limited to 64 qwords) you would have the chance to use hard-coded logic to insert your result fromrax
directly in the right place, e.g,. withpinrsq
ormovq
sometimes combined with a shuffle to let you access the high lanes. It will probably take only 1.25 instructions per iteration, much better than 32!Vertical Lanes
Your current shuffling solution could be characterized as horizontal rotation through a register, carrying from the high qword of
ymm N
into the low qword ofymm N+1
. That is, adjacent elements within one register are logically adjacent in your scheme. Instead, you could do a vertical rotation, where instead elements in a givenqword
lane are logically adjacent to the elements in the same lane in registersymm N-1
andymm N+1
. This avoids the need for any horizontal shuffling, and most of the shifting needs only a single register-registermov
. You only need special handling for the first and last registers to wrap your elements around into the next lane.Something like this:
That's about as simple as you are going to get for a general "shift every element" strategy: a single
vmovdqa
per usedymm
register, plus to additional instructions to do the wraparound and new element insertion. As far as vector operations go register-register moves are pretty much faster than any other type of operation since they can be move-eliminated (0 latency) and can execute 4 per cycle.This approach does need a temporary register (
ymm15
in the example above) and I can't think of an easy way to eliminate that, so you'd be able to use at most 15 registers as part of your "queue".Indirect Jump
You could do a calculated indirect jump based on the iteration count to a short (2-4 instructions) sequence that puts the element into the right place. Basically a
vpinsrq
and in some cases some additional shuffling to access the high lane.This type of table could be fully general, i.e., allow writes to arbitrary indexes in any order, but if you knew you were indexing sequentially as above, you could simplify the table using that assumption (i.e,. you can handle the high lane by writing into the low elements first and then using a
vinserti128
or something like that to move them both into the high lane at the right time.This table will presumably mispredict the first time though. After that, it may or may not, depending on the pattern and the strength of the indirect branch predictor.
You can't
vpinsrq
into a YMM register. Only an xmm destination is available, so it unavoidably zeros the upper lane of the full YMM register. It was introduced with AVX1 as the VEX version of the 128-bit instruction. AVX2 and AVX512 did not upgrade it to YMM/ZMM destinations. I'm guessing they didn't want to provide insert into high lanes, and it would have been odd to provide a YMM version that still only looked at the lowest bit of the imm8.You're going to need a scratch register and then blend into a YMM with
vpblendd
. Or (on Skylake or AMD) use the legacy-SSE version to leave the upper bytes unchanged! On Skylake, writing an XMM reg with a legacy-SSE instruction has a false dependency on the full register. You want this false dependency. (I haven't tested this; it might trigger a merging uop of some sort). But you don't want this on Haswell where it saves the upper halves of all the YMM regs, going into "state C".The obvious solution is to leave yourself a scratch reg to use for
vmovq
+vpblendd
(instead ofvpinsrq y,r,0
). That's still 2 uops, butvpblendd
doesn't need port 5 on Intel CPUs, in case that matters. (movq uses port 5). If you're really hard up for space, the
mm0..7` MMX registers are available.Reducing the cost
With nested loops, we can split the work. With a small amount of unrolling of the inner loop, we can mostly remove that part of the cost.
For example, if we have an inner loop produce 4 results, we can use your brute-force stack approach on 2 or 4 registers in the inner loop, giving moderate overhead with no actual unrolling ("magic" payload appears only once). 3 or 4 uops, optionally with no loop-carried dep chain.
Bonus: this only requires AVX1 (and is cheaper on AMD, keeping 256-bit vectors out of the inner loop). We still get 12 x 4 qwords of storage instead of 16 x 4. That was an arbitrary number anyway.
Limited unrolling
We can unroll just the inner loop, like this:
When we leave the loop, ymm15..2 and xmm1 and 0 full of valuable data. If they were at the bottom, they'd run the same number of times but ymm2 would be a copy of xmm0 and 1. A
jmp
to enter the loop without doing thevmovdqa
stuff on the first iter is an option.Per 4x
magic
, this costs us 6 uops for port 5 (movq + pinsrq), 12vmovdqa
(no execution unit), and 1x vinserti128 (port 5 again). So that's 19 uops per 4magic
, or 4.75 uops.You can interleave the
vmovdqa
+vinsert
with the firstmagic
, or just split it before / after the firstmagic
. You can't clobber xmm0 until after thevinserti128
, but if you have a spare integer reg you can delay thevmovq
.More nesting
Another loop nesting level, or another unrolling, would greatly reduce the amount of
vmovdqa
instructions. Just getting the data shuffled into YMM regs at all has a minimum cost, though. Loading an xmm from GP regs.AVX512 can give us cheaper int->xmm. (And it would allow writing to all 4 elements of a YMM). But I don't see it avoiding the need to unroll or nest loops to avoid touching all the registers every time.
PS:
My first idea for the shuffle accumulator was shuffling elements one to the left. But then I realized this ended up with 5 elements of state, not 4, because we had high and low in two regs, plus the newly-written xmm0. (And could have used vpalignr.)
Leaving here as an example of what you can do with
vshufpd
: move low to high in one register, and merge in the high from another as the new low.AVX512: indexing vectors as memory
For the general case of writing to vector regs as memory, we can
vpbroadcastq zmm0{k1}, rax
and repeat for otherzmm
registers with a differentk1
mask. Broadcasts with merge-masking (where the mask has a single bit set) give us an indexed store into vector registers, but we need one instruction for every possible destination register.Creating the mask:
To read from a ZMM register:
(See the docs for
vpcompressq
: with zero-masking it zeros all elements above the one it writes)To hide vpcompressq latency, you can do multiple dep chains into multiple tmp vectors, then
vpor xmm0, xmm0, xmm1
at the end. (One of the vectors will be all zero, the other will have the selected element.)On SKX it has 3c latency and 2c throughput, according to this instatx64 report.