How to move 128-bit immediates to XMM registers

2019-02-07 22:33发布

问题:

There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too.

The question is: how do you write a sequence of assembly code to initialize an XMM register with a 128-bit immediate (constant) value?

回答1:

Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language, Generating constants, section 13.4, page 121.



回答2:

You can do it like this, with just one movaps instruction:

.section .rodata    # put your constants in the read-only data section
.p2align 4          # align to 16 = 1<<4
LC0:
        .long   1082130432
        .long   1077936128
        .long   1073741824
        .long   1065353216

.text
foo:
        movaps  LC0(%rip), %xmm0

Loading it with a data load is usually preferable to embedding it in the instruction stream, especially because of how many instructions it takes. That's several extra uops for the CPU to execute, for an arbitrary constant that can't be generated from all-ones with a couple shifts.

If it's easier, you can put constants right before or after a function that you jit-compile, instead of in a separate section. But since CPUs have split L1d / L1i caches and TLBs, it's generally best to group constants together separate from instructions.

If both halves of your constant are the same, you can broadcast-load it with SSE3
movddup (m64), %xmm0.



回答3:

As one of the 10000 ways to do it, use SSE4.1 pinsrq

mov    rax, first half
movq   xmm0, rax      ; better than pinsrq xmm0,rax,0 for performance and code-size

mov    rax, second half
pinsrq xmm0, rax, 1


回答4:

There are multiple ways of embedding constants in the instruction stream:

  1. by using immediate operands
  2. by loading from PC-relative addresses

So while there is no way to do an immediate load into a XMM register, it's possible to do a PC-relative load (in 64bit) from a value stored "right next" to where the code executes. That creates something like:

.align 4
.val:
    .long   0x12345678
    .long   0x9abcdef0
    .long   0xfedbca98
    .long   0x76543210
func:
     movdqa .val(%rip), %xmm0

When you disassemble:

0000000000000000 :
   0:   78 56 34 12 f0 de bc 9a
   8:   98 ca db fe 10 32 54 76

0000000000000010 :
  10:   66 0f 6f 05 e8 ff ff    movdqa -0x18(%rip),%xmm0        # 0 

which is utterly compact, 23 Bytes.

Other options are to construct the value on the stack and again load it from there. In 32bit x86, where you don't have %rip-relative memory access, one can still do that in 24 Bytes (assuming the stackpointer is aligned on entry; else, unaligned load required):

00000000 :
   0:   68 78 56 34 12          push   $0x12345678
   5:   68 f0 de bc 9a          push   $0x9abcdef0
   a:   68 98 ca db fe          push   $0xfedbca98
   f:   68 10 32 54 76          push   $0x76543210
  14:   66 0f 6f 04 24          movdqa (%esp),%xmm0

While in 64bit (stackpointer alignment at function entry is guaranteed there by the ABI) that'd take 27 Bytes:

0000000000000000 :
   0:   48 b8 f0 de bc 9a 78 56 34 12   movabs $0x123456789abcdef0,%rax
   a:   50                              push   %rax
   b:   48 b8 10 32 54 76 98 ba dc fe   movabs $0xfedcba9876543210,%rax
  15:   50                              push   %rax
  16:   66 0f 6f 04 24                  movdqa (%rsp),%xmm0

If you compare any of these with the MOVLHPS version, you'll notice it's the longest:

0000000000000000 :
   0:   48 b8 f0 de bc 9a 78 56 34 12   movabs $0x123456789abcdef0,%rax
   a:   66 48 0f 6e c0                  movq   %rax,%xmm0
   f:   48 b8 10 32 54 76 98 ba dc fe   movabs $0xfedcba9876543210,%rax
  19:   66 48 0f 6e c8                  movq   %rax,%xmm1
  1e:   0f 16 c1                        movlhps %xmm1,%xmm0

at 33 Bytes.

The other advantage of loading directly from instruction memory is that the movdqa doesn't depend on anything previous. Most likely, the first version, as given by @Paul R, is the fastest you can get.



回答5:

The best solution (especially if you want to stick to SSE2 - i.e. to avoid using AVX) to initialize two registers (say, xmm0 and xmm1) with the two 64-bit halves of your immediate value, do MOVLHPS xmm0,xmm1 In order to initialize a 64-bit value, the easiest solution is to use a general-purpose register (say, AX), and then use MOVQ to transfer its value to the XMM register. So the sequence would be something like this:

MOV RAX, <first_half>
MOVQ XMM0, RAX
MOV RAX, <second_half>
MOVQ XMM1, RAX
MOVLHPS XMM0,XMM1