I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM.
How can I move an XMM register value (128-bit) to two 64-bit general purpose registers?
movq RAX XMM1 ; 0th bit to 63th bit
mov? RCX XMM1 ; 64th bit to 127th bit
Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers?
movd EAX XMM1 ; 0th bit to 31th bit
mov? ECX XMM1 ; 32th bit to 63th bit
mov? EDX XMM1 ; 64th bit to 95th bit
mov? ESI XMM1 ; 96th bit to 127 bit
You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.
in registers
movq rax,xmm0 ;lower 64 bits
movhlps xmm0,xmm0 ;move high 64 bits to low 64 bits.
movq rbx,xmm0 ;high 64 bits.
via memory
movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]
slow, but does not destroy xmm register
mov rax,xmm0
pextrq rbx,xmm0,1 ;3 cycle latency on Ryzen!
For 32 bits, the code is similar:
in registers
movd eax,xmm0
psrldq xmm0,xmm0,4 ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4
movd ecx,xmm0
psrlq xmm0,xmm0,4
movd edx,xmm0
via memory
movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]
slow, but does not destroy xmm register
mov eax,xmm0
pextrd ebx,xmm0,1 ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2
pextrd edx,xmm0,3
The 64-bit shift variant can run in 2 cycles. The pextrq
version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.
On Intel SnB-family (including Skylake), shuffle+movq
or movd
has the same performance as a pextrq
/d
. It decodes to a shuffle uop and a movd
uop, so this is not surprising.
On AMD Ryzen, pextrq
apparently has 1 cycle lower latency than shuffle + movq
. pextrd/q
is 3c latency, and so is movd/q
, according to Agner Fog's tables. This is a neat trick (if it's accurate), since pextrd/q
does decode to 2 uops (vs. 1 for movq
).
Since shuffles have non-zero latency, shuffle+movq
is always strictly worse than pextrq
on Ryzen (except for possible front-end decode / uop-cache effects).
The major downside to a pure ALU strategy for extracting all elements is throughput: it takes a lot of ALU uops, and most CPUs only have one execution unit / port that can move data from XMM to integer. Store/reload has higher latency for the first element, but better throughput (because modern CPUs can do 2 loads per cycle). If the surrounding code is bottlenecked by ALU throughput, a store/reload strategy could be good. Maybe do the low element with a movd
or movq
so out-of-order execution can get started on whatever uses it while the rest of the vector data is going through store forwarding.
Another option worth considering (besides what Johan mentioned) for extracting 32-bit elements to integer registers is to do some of the "shuffling" with integer shifts:
mov rax,xmm0
# use eax now, before destroying it
shr rax,32
pextrq rcx,xmm0,1
# use ecx now, before destroying it
shr rcx, 32
shr
can run on p0 or p6 in Intel Haswell/Skylake. p6 has no vector ALUs, so this sequence is quite good if you want low latency but also low pressure on vector ALUs.
Or if you want to keep them around:
mov rax,xmm0
rorx rbx, rax, 32 # BMI2
# shld rbx, rax, 32 # alternative that has a false dep on rbx
# eax=xmm0[0], ebx=xmm0[1]
pextrq rdx,xmm0,1
mov ecx, edx # the "normal" way, if you don't want rorx or shld
shr rdx, 32
# ecx=xmm0[2], edx=xmm0[3]
The following handles both get and set and seems to work (I think it's AT&T syntax):
#include <iostream>
int main() {
uint64_t lo1(111111111111L);
uint64_t hi1(222222222222L);
uint64_t lo2, hi2;
asm volatile (
"movq %3, %%xmm0 ; " // set high 64 bits
"pslldq $8, %%xmm0 ; " // shift left 64 bits
"movsd %2, %%xmm0 ; " // set low 64 bits
// operate on 128 bit register
"movq %%xmm0, %0 ; " // get low 64 bits
"movhlps %%xmm0, %%xmm0 ; " // move high to low
"movq %%xmm0, %1 ; " // get high 64 bits
: "=x"(lo2), "=x"(hi2)
: "x"(lo1), "x"(hi1)
: "%xmm0"
);
std::cout << "lo1: [" << lo1 << "]" << std::endl;
std::cout << "hi1: [" << hi1 << "]" << std::endl;
std::cout << "lo2: [" << lo2 << "]" << std::endl;
std::cout << "hi2: [" << hi2 << "]" << std::endl;
return 0;
}