128-bit values - From XMM registers to General Pur

2019-03-27 12:44发布

问题:

I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM.

  1. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers?

    movq RAX XMM1 ; 0th bit to 63th bit
    mov? RCX XMM1 ; 64th bit to 127th bit
    
  2. Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers?

    movd EAX XMM1 ; 0th bit to 31th bit
    mov? ECX XMM1 ; 32th bit to 63th bit
    
    mov? EDX XMM1 ; 64th bit to 95th bit
    mov? ESI XMM1 ; 96th bit to 127 bit
    

回答1:

You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.

in registers

movq rax,xmm0       ;lower 64 bits
movhlps xmm0,xmm0   ;move high 64 bits to low 64 bits.
movq rbx,xmm0       ;high 64 bits.

via memory

movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]

slow, but does not destroy xmm register

mov rax,xmm0
pextrq rbx,xmm0,1        ;3 cycle latency on Ryzen!

For 32 bits, the code is similar:

in registers

movd eax,xmm0
psrldq xmm0,xmm0,4    ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4
movd ecx,xmm0
psrlq xmm0,xmm0,4
movd edx,xmm0

via memory

movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]

slow, but does not destroy xmm register

mov eax,xmm0
pextrd ebx,xmm0,1        ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2       
pextrd edx,xmm0,3       

The 64-bit shift variant can run in 2 cycles. The pextrq version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.



回答2:

On Intel SnB-family (including Skylake), shuffle+movq or movd has the same performance as a pextrq/d. It decodes to a shuffle uop and a movd uop, so this is not surprising.

On AMD Ryzen, pextrq apparently has 1 cycle lower latency than shuffle + movq. pextrd/q is 3c latency, and so is movd/q, according to Agner Fog's tables. This is a neat trick (if it's accurate), since pextrd/q does decode to 2 uops (vs. 1 for movq).

Since shuffles have non-zero latency, shuffle+movq is always strictly worse than pextrq on Ryzen (except for possible front-end decode / uop-cache effects).

The major downside to a pure ALU strategy for extracting all elements is throughput: it takes a lot of ALU uops, and most CPUs only have one execution unit / port that can move data from XMM to integer. Store/reload has higher latency for the first element, but better throughput (because modern CPUs can do 2 loads per cycle). If the surrounding code is bottlenecked by ALU throughput, a store/reload strategy could be good. Maybe do the low element with a movd or movq so out-of-order execution can get started on whatever uses it while the rest of the vector data is going through store forwarding.


Another option worth considering (besides what Johan mentioned) for extracting 32-bit elements to integer registers is to do some of the "shuffling" with integer shifts:

mov  rax,xmm0
# use eax now, before destroying it
shr  rax,32    

pextrq rcx,xmm0,1
# use ecx now, before destroying it
shr  rcx, 32

shr can run on p0 or p6 in Intel Haswell/Skylake. p6 has no vector ALUs, so this sequence is quite good if you want low latency but also low pressure on vector ALUs.


Or if you want to keep them around:

mov  rax,xmm0
rorx rbx, rax, 32    # BMI2
# shld rbx, rax, 32  # alternative that has a false dep on rbx
# eax=xmm0[0], ebx=xmm0[1]

pextrq rdx,xmm0,1
mov  ecx, edx     # the "normal" way, if you don't want rorx or shld
shr  rdx, 32
# ecx=xmm0[2], edx=xmm0[3]


回答3:

The following handles both get and set and seems to work (I think it's AT&T syntax):

#include <iostream>

int main() {
    uint64_t lo1(111111111111L);
    uint64_t hi1(222222222222L);
    uint64_t lo2, hi2;

    asm volatile (
            "movq       %3,     %%xmm0      ; " // set high 64 bits
            "pslldq     $8,     %%xmm0      ; " // shift left 64 bits
            "movsd      %2,     %%xmm0      ; " // set low 64 bits
                                                // operate on 128 bit register
            "movq       %%xmm0, %0          ; " // get low 64 bits
            "movhlps    %%xmm0, %%xmm0      ; " // move high to low
            "movq       %%xmm0, %1          ; " // get high 64 bits
            : "=x"(lo2), "=x"(hi2)
            : "x"(lo1), "x"(hi1)
            : "%xmm0"
    );

    std::cout << "lo1: [" << lo1 << "]" << std::endl;
    std::cout << "hi1: [" << hi1 << "]" << std::endl;
    std::cout << "lo2: [" << lo2 << "]" << std::endl;
    std::cout << "hi2: [" << hi2 << "]" << std::endl;

    return 0;
}


标签: assembly x86 sse