Let's say you have values in rax
and rdx
you want to load into an xmm
register.
One way would be:
movq xmm0, rax
pinsrq xmm0, rdx, 1
It's pretty slow though! Is there a better way?
Let's say you have values in rax
and rdx
you want to load into an xmm
register.
One way would be:
movq xmm0, rax
pinsrq xmm0, rdx, 1
It's pretty slow though! Is there a better way?
You're not going to do better for latency or uop count on recent Intel or AMD (I mostly looked at Agner Fog's tables for Ryzen / Skylake). movq+movq+punpcklqdq
is also 3 uops, for the same port(s).
On Intel / AMD, storing the GP registers to a temporary location and reloading them with a 16-byte read may be worth considering for throughput if surrounding code bottlenecks on the ALU port for integer->vector, which is port 5 for recent Intel.
On Intel, pinsrq x,r,imm
is 2 uops for port 5 and movq xmm,r64
is also 1 uop for port 5.
movhps xmm, [mem]
can micro-fuse the load, but it still needs a port 5 ALU uop. So movq xmm0,rax
/ mov [rsp-8], rdx
/ movhps xmm0, [rsp-8]
is 3 fused-domain uops, 2 of them needing port 5 on recent Intel. The store-forwarding latency makes this significantly higher latency than an insert.
Storing both GP regs with store / store / movdqa
(long store-forwarding stall from reading the two narrower stores with a larger load) is also 3 uops, but is the only reasonable sequence that avoids any port 5 uops. The ~15 cycles of latency is so much that Out-of-Order execution could easily have trouble hiding it.
For YMM and/or narrower elements, stores + reload is more worth considering because you amortize the stall over more stores / it saves you more shuffle uops. But it still shouldn't be your go-to strategy for 32-bit elements.
For narrower elements, it would be nice if there was a single-uop way of packing 2 narrow integers into a 64-bit integer register, so set up for wider transfers to XMM regs. But there isn't: Packing two DWORDs into a QWORD to save store bandwidth shld
is 1 uop on Intel SnB-family but needs one of the inputs at the top of a register. x86 has pretty weak bitfield insert/extract instructions compared to PowerPC or ARM, requiring multiple instructions per merge (other than store/reload, and store throughput of 1 per clock can easily become a bottleneck).
According to the spreadsheet from http://instlatx64.atw.hu/ (taking uop data from IACA), it only costs 1 port5 uop to broadcast any width of integer register to a x/y/zmm vector on Skylake-AVX512.
Agner doesn't seem to have tested integer source regs on KNL, but a similar VPBROADCASTMB2Q v,k
(mask register source) is 1 uop.
With a mask register already set up: only 2 uops total:
; k1 = 0b0010
vmovq xmm0, rax ; 1 uop p5 ; AVX1
vpbroadcastq xmm0{k1}, rdx ; 1 uop p5 merge-masking
I think merge-masking is "free" even for ALU uops. Note that we do the VMOVQ first so we can avoid a longer EVEX encoding for it. But if you have 0001
in a mask reg instead of 0010
, blend it into an unmasked broadcast with vmovq xmm0{k1}, rax
.
With more mask registers set up, we can do 1 reg per uop:
vmovq xmm0, rax 2c latency
vpbroadcastq xmm0{k1}, rdx ; k1 = 0b0010 3c latency
vpbroadcastq ymm0{k2}, rdi ; k2 = 0b0100 3c latency
vpbroadcastq ymm0{k3}, rsi ; k3 = 0b1000 3c latency
(For a full ZMM vector, maybe start a 2nd dep chain and vinserti64x4
to combine 256-bit halves. Also means only 3 k registers instead of 7. It costs 1 extra shuffle uop, but unless there's some software pipelining, OoO exec might have trouble hiding the latency of 7 merges = 21c before you do anything with your vector.)
; high 256 bits: maybe better to start again with vmovq instead of continuing
vpbroadcastq zmm0{k4}, rcx ; k4 =0b10000 3c latency
... filling up the ZMM reg
Intel's listed latency for vpbroadcastq
on SKX is still 3c even when the destination is only xmm, according to the Instlatx64 spreadsheet which quotes that and other sources. http://instlatx64.atw.hu/
The same document does list vpbroadcastq xmm,xmm
as 1c latency, so presumably it's correct that we get 3c latency per step in the merging dependency chain. Merge-masking uops unfortunately need the destination register to be ready as early as other inputs; so the merging part of the operation can't forward separately.
Starting with k1 = 2 = 0b0010
, we can init the rest with KSHIFT:
mov eax, 0b0010 = 2
kmovw k1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2
# KSHIFTLW k4, k1, 3
# ...
KSHIFT runs only on port 5 (SKX), but so does KMOV; moving each mask from integer registers would just cost extra instructions to set up integer regs first.
It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.
We eventually write all the elements. We could start with KXNOR k0, k0,k0
to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2
/ kmovw k1, eax
being p0156 + p5.
Without a mask register: (There's no kmov k1, imm
, and loading from memory costs multiple uops, so as a one-off there's no 3-uop option using merge-masking. But in a loop if you can spare some mask regs, that appears to be far better.)
VPBROADCASTQ xmm1, rdx ; 1 uop p5 ; AVX512VL (ZMM1 for just AVX512F)
vmovq xmm0, rax ; 1 uop p5 ; AVX1
vpblendd xmm0, xmm0, xmm1, 0b1100 ; 1 uop p015 ; AVX2
; SKX: 3 uops: 2p5 + p015
; KNL: 3 uops: ? + ? + FP0/1
The only benefit here is that one of the 3 uops doesn't need port 5.
vmovsd xmm1, xmm1, xmm0
would also blend the two halves, but only runs on port 5 on recent Intel, unlike an integer immediate blend which runs on any vector ALU port.
gcc likes to store/reload, which is not optimal on anything except in very rare port 5-bound situations where a large amount of latency doesn't matter. I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833, with more discussion of what might be optimal on various architectures for 32-bit or 64-bit elements.
I suggested the above vpbroadcastq
replacement for insert with AVX512 on the first bug.
(If compiling _mm_set_epi64x
, definitely use -mtune=haswell
or something recent, to avoid the crappy tuning for the default mtune=generic
. Or use -march=native
if your binaries will only run on the local machine.)