Loading an xmm from GP regs

Let's say you have values in rax and rdx you want to load into an xmm register.

One way would be:

movq     xmm0, rax
pinsrq   xmm0, rdx, 1

It's pretty slow though! Is there a better way?

标签： assembly x86 sse simd micro-optimization

1条回答

小情绪 Triste *

2楼-- · 2020-02-07 04:27

You're not going to do better for latency or uop count on recent Intel or AMD (I mostly looked at Agner Fog's tables for Ryzen / Skylake). movq+movq+punpcklqdq is also 3 uops, for the same port(s).

On Intel / AMD, storing the GP registers to a temporary location and reloading them with a 16-byte read may be worth considering for throughput if surrounding code bottlenecks on the ALU port for integer->vector, which is port 5 for recent Intel.

On Intel, pinsrq x,r,imm is 2 uops for port 5 and movq xmm,r64 is also 1 uop for port 5.

movhps xmm, [mem] can micro-fuse the load, but it still needs a port 5 ALU uop. So movq xmm0,rax / mov [rsp-8], rdx / movhps xmm0, [rsp-8] is 3 fused-domain uops, 2 of them needing port 5 on recent Intel. The store-forwarding latency makes this significantly higher latency than an insert.

Storing both GP regs with store / store / movdqa (long store-forwarding stall from reading the two narrower stores with a larger load) is also 3 uops, but is the only reasonable sequence that avoids any port 5 uops. The ~15 cycles of latency is so much that Out-of-Order execution could easily have trouble hiding it.

For YMM and/or narrower elements, stores + reload is more worth considering because you amortize the stall over more stores / it saves you more shuffle uops. But it still shouldn't be your go-to strategy for 32-bit elements.

For narrower elements, it would be nice if there was a single-uop way of packing 2 narrow integers into a 64-bit integer register, so set up for wider transfers to XMM regs. But there isn't: Packing two DWORDs into a QWORD to save store bandwidth shld is 1 uop on Intel SnB-family but needs one of the inputs at the top of a register. x86 has pretty weak bitfield insert/extract instructions compared to PowerPC or ARM, requiring multiple instructions per merge (other than store/reload, and store throughput of 1 per clock can easily become a bottleneck).

AVX512F can broadcast to a vector from an integer reg, and merge-masking allows single-uop inserts.

According to the spreadsheet from http://instlatx64.atw.hu/ (taking uop data from IACA), it only costs 1 port5 uop to broadcast any width of integer register to a x/y/zmm vector on Skylake-AVX512.

Agner doesn't seem to have tested integer source regs on KNL, but a similar VPBROADCASTMB2Q v,k (mask register source) is 1 uop.

With a mask register already set up: only 2 uops total:

; k1 = 0b0010

vmovq         xmm0, rax           ; 1 uop p5             ; AVX1
vpbroadcastq  xmm0{k1}, rdx       ; 1 uop p5  merge-masking

I think merge-masking is "free" even for ALU uops. Note that we do the VMOVQ first so we can avoid a longer EVEX encoding for it. But if you have 0001 in a mask reg instead of 0010, blend it into an unmasked broadcast with vmovq xmm0{k1}, rax.

With more mask registers set up, we can do 1 reg per uop:

vmovq         xmm0, rax                         2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b0010     3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b0100     3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b1000     3c latency

(For a full ZMM vector, maybe start a 2nd dep chain and vinserti64x4 to combine 256-bit halves. Also means only 3 k registers instead of 7. It costs 1 extra shuffle uop, but unless there's some software pipelining, OoO exec might have trouble hiding the latency of 7 merges = 21c before you do anything with your vector.)

; high 256 bits: maybe better to start again with vmovq instead of continuing
vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b10000     3c latency
... filling up the ZMM reg

Intel's listed latency for vpbroadcastq on SKX is still 3c even when the destination is only xmm, according to the Instlatx64 spreadsheet which quotes that and other sources. http://instlatx64.atw.hu/

The same document does list vpbroadcastq xmm,xmm as 1c latency, so presumably it's correct that we get 3c latency per step in the merging dependency chain. Merge-masking uops unfortunately need the destination register to be ready as early as other inputs; so the merging part of the operation can't forward separately.

Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

mov      eax, 0b0010 = 2
kmovw    k1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2

#  KSHIFTLW k4, k1, 3
# ...

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving each mask from integer registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.
We eventually write all the elements. We could start with KXNOR k0, k0,k0 to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2 / kmovw k1, eax being p0156 + p5.

Without a mask register: (There's no kmov k1, imm, and loading from memory costs multiple uops, so as a one-off there's no 3-uop option using merge-masking. But in a loop if you can spare some mask regs, that appears to be far better.)

VPBROADCASTQ  xmm1, rdx           ; 1 uop  p5      ; AVX512VL (ZMM1 for just AVX512F)
vmovq         xmm0, rax           ; 1 uop p5             ; AVX1
vpblendd      xmm0, xmm0, xmm1, 0b1100    ; 1 uop p015   ; AVX2

; SKX: 3 uops:  2p5 + p015
; KNL: 3 uops: ? + ? + FP0/1

The only benefit here is that one of the 3 uops doesn't need port 5.

vmovsd xmm1, xmm1, xmm0 would also blend the two halves, but only runs on port 5 on recent Intel, unlike an integer immediate blend which runs on any vector ALU port.

More discussion about integer -> vector strategies

gcc likes to store/reload, which is not optimal on anything except in very rare port 5-bound situations where a large amount of latency doesn't matter. I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833, with more discussion of what might be optimal on various architectures for 32-bit or 64-bit elements.

I suggested the above vpbroadcastq replacement for insert with AVX512 on the first bug.

(If compiling _mm_set_epi64x, definitely use -mtune=haswell or something recent, to avoid the crappy tuning for the default mtune=generic. Or use -march=native if your binaries will only run on the local machine.)

0人赞添加讨论(0) 举报