Let's say you have values in rax
and rdx
you want to load into an xmm
register.
One way would be:
movq xmm0, rax
pinsrq xmm0, rdx, 1
It's pretty slow though! Is there a better way?
Let's say you have values in rax
and rdx
you want to load into an xmm
register.
One way would be:
movq xmm0, rax
pinsrq xmm0, rdx, 1
It's pretty slow though! Is there a better way?
You're not going to do better for latency or uop count on recent Intel or AMD (I mostly looked at Agner Fog's tables for Ryzen / Skylake).
movq+movq+punpcklqdq
is also 3 uops, for the same port(s).On Intel / AMD, storing the GP registers to a temporary location and reloading them with a 16-byte read may be worth considering for throughput if surrounding code bottlenecks on the ALU port for integer->vector, which is port 5 for recent Intel.
On Intel,
pinsrq x,r,imm
is 2 uops for port 5 andmovq xmm,r64
is also 1 uop for port 5.movhps xmm, [mem]
can micro-fuse the load, but it still needs a port 5 ALU uop. Somovq xmm0,rax
/mov [rsp-8], rdx
/movhps xmm0, [rsp-8]
is 3 fused-domain uops, 2 of them needing port 5 on recent Intel. The store-forwarding latency makes this significantly higher latency than an insert.Storing both GP regs with store / store /
movdqa
(long store-forwarding stall from reading the two narrower stores with a larger load) is also 3 uops, but is the only reasonable sequence that avoids any port 5 uops. The ~15 cycles of latency is so much that Out-of-Order execution could easily have trouble hiding it.For YMM and/or narrower elements, stores + reload is more worth considering because you amortize the stall over more stores / it saves you more shuffle uops. But it still shouldn't be your go-to strategy for 32-bit elements.
For narrower elements, it would be nice if there was a single-uop way of packing 2 narrow integers into a 64-bit integer register, so set up for wider transfers to XMM regs. But there isn't: Packing two DWORDs into a QWORD to save store bandwidth
shld
is 1 uop on Intel SnB-family but needs one of the inputs at the top of a register. x86 has pretty weak bitfield insert/extract instructions compared to PowerPC or ARM, requiring multiple instructions per merge (other than store/reload, and store throughput of 1 per clock can easily become a bottleneck).AVX512F can broadcast to a vector from an integer reg, and merge-masking allows single-uop inserts.
According to the spreadsheet from http://instlatx64.atw.hu/ (taking uop data from IACA), it only costs 1 port5 uop to broadcast any width of integer register to a x/y/zmm vector on Skylake-AVX512.
Agner doesn't seem to have tested integer source regs on KNL, but a similar
VPBROADCASTMB2Q v,k
(mask register source) is 1 uop.With a mask register already set up: only 2 uops total:
I think merge-masking is "free" even for ALU uops. Note that we do the VMOVQ first so we can avoid a longer EVEX encoding for it. But if you have
0001
in a mask reg instead of0010
, blend it into an unmasked broadcast withvmovq xmm0{k1}, rax
.With more mask registers set up, we can do 1 reg per uop:
(For a full ZMM vector, maybe start a 2nd dep chain and
vinserti64x4
to combine 256-bit halves. Also means only 3 k registers instead of 7. It costs 1 extra shuffle uop, but unless there's some software pipelining, OoO exec might have trouble hiding the latency of 7 merges = 21c before you do anything with your vector.)Intel's listed latency for
vpbroadcastq
on SKX is still 3c even when the destination is only xmm, according to the Instlatx64 spreadsheet which quotes that and other sources. http://instlatx64.atw.hu/The same document does list
vpbroadcastq xmm,xmm
as 1c latency, so presumably it's correct that we get 3c latency per step in the merging dependency chain. Merge-masking uops unfortunately need the destination register to be ready as early as other inputs; so the merging part of the operation can't forward separately.Starting with
k1 = 2 = 0b0010
, we can init the rest with KSHIFT:KSHIFT runs only on port 5 (SKX), but so does KMOV; moving each mask from integer registers would just cost extra instructions to set up integer regs first.
It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.
We eventually write all the elements. We could start with
KXNOR k0, k0,k0
to generate a -1 and left-shift that, but that's 2 port5 uops vs.mov eax,2
/kmovw k1, eax
being p0156 + p5.Without a mask register: (There's no
kmov k1, imm
, and loading from memory costs multiple uops, so as a one-off there's no 3-uop option using merge-masking. But in a loop if you can spare some mask regs, that appears to be far better.)The only benefit here is that one of the 3 uops doesn't need port 5.
vmovsd xmm1, xmm1, xmm0
would also blend the two halves, but only runs on port 5 on recent Intel, unlike an integer immediate blend which runs on any vector ALU port.More discussion about integer -> vector strategies
gcc likes to store/reload, which is not optimal on anything except in very rare port 5-bound situations where a large amount of latency doesn't matter. I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833, with more discussion of what might be optimal on various architectures for 32-bit or 64-bit elements.
I suggested the above
vpbroadcastq
replacement for insert with AVX512 on the first bug.(If compiling
_mm_set_epi64x
, definitely use-mtune=haswell
or something recent, to avoid the crappy tuning for the defaultmtune=generic
. Or use-march=native
if your binaries will only run on the local machine.)