There already is a question on this, but it was closed as "ambiguous" so I'm opening a new one - I've found the answer, maybe it will help others too.
The question is: how do you write a sequence of assembly code to initialize an XMM register with a 128-bit immediate (constant) value?
Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language, Generating constants, section 13.4, page 121.
You can do it like this, with just one movaps
instruction:
.section .rodata # put your constants in the read-only data section
.p2align 4 # align to 16 = 1<<4
LC0:
.long 1082130432
.long 1077936128
.long 1073741824
.long 1065353216
.text
foo:
movaps LC0(%rip), %xmm0
Loading it with a data load is usually preferable to embedding it in the instruction stream, especially because of how many instructions it takes. That's several extra uops for the CPU to execute, for an arbitrary constant that can't be generated from all-ones with a couple shifts.
If it's easier, you can put constants right before or after a function that you jit-compile, instead of in a separate section. But since CPUs have split L1d / L1i caches and TLBs, it's generally best to group constants together separate from instructions.
If both halves of your constant are the same, you can broadcast-load it with SSE3
movddup (m64), %xmm0
.
As one of the 10000 ways to do it, use SSE4.1 pinsrq
mov rax, first half
movq xmm0, rax ; better than pinsrq xmm0,rax,0 for performance and code-size
mov rax, second half
pinsrq xmm0, rax, 1
There are multiple ways of embedding constants in the instruction stream:
- by using immediate operands
- by loading from PC-relative addresses
So while there is no way to do an immediate load into a XMM
register, it's possible to do a PC-relative load (in 64bit) from a value stored "right next" to where the code executes. That creates something like:
.align 4
.val:
.long 0x12345678
.long 0x9abcdef0
.long 0xfedbca98
.long 0x76543210
func:
movdqa .val(%rip), %xmm0
When you disassemble:
0000000000000000 :
0: 78 56 34 12 f0 de bc 9a
8: 98 ca db fe 10 32 54 76
0000000000000010 :
10: 66 0f 6f 05 e8 ff ff movdqa -0x18(%rip),%xmm0 # 0
which is utterly compact, 23 Bytes.
Other options are to construct the value on the stack and again load it from there. In 32bit x86, where you don't have %rip
-relative memory access, one can still do that in 24 Bytes (assuming the stackpointer is aligned on entry; else, unaligned load required):
00000000 :
0: 68 78 56 34 12 push $0x12345678
5: 68 f0 de bc 9a push $0x9abcdef0
a: 68 98 ca db fe push $0xfedbca98
f: 68 10 32 54 76 push $0x76543210
14: 66 0f 6f 04 24 movdqa (%esp),%xmm0
While in 64bit (stackpointer alignment at function entry is guaranteed there by the ABI) that'd take 27 Bytes:
0000000000000000 :
0: 48 b8 f0 de bc 9a 78 56 34 12 movabs $0x123456789abcdef0,%rax
a: 50 push %rax
b: 48 b8 10 32 54 76 98 ba dc fe movabs $0xfedcba9876543210,%rax
15: 50 push %rax
16: 66 0f 6f 04 24 movdqa (%rsp),%xmm0
If you compare any of these with the MOVLHPS
version, you'll notice it's the longest:
0000000000000000 :
0: 48 b8 f0 de bc 9a 78 56 34 12 movabs $0x123456789abcdef0,%rax
a: 66 48 0f 6e c0 movq %rax,%xmm0
f: 48 b8 10 32 54 76 98 ba dc fe movabs $0xfedcba9876543210,%rax
19: 66 48 0f 6e c8 movq %rax,%xmm1
1e: 0f 16 c1 movlhps %xmm1,%xmm0
at 33 Bytes.
The other advantage of loading directly from instruction memory is that the movdqa
doesn't depend on anything previous. Most likely, the first version, as given by @Paul R, is the fastest you can get.
The best solution (especially if you want to stick to SSE2 - i.e. to avoid using AVX) to initialize two registers (say, xmm0 and xmm1) with the two 64-bit halves of your immediate value, do MOVLHPS xmm0,xmm1
In order to initialize a 64-bit value, the easiest solution is to use a general-purpose register (say, AX), and then use MOVQ to transfer its value to the XMM register.
So the sequence would be something like this:
MOV RAX, <first_half>
MOVQ XMM0, RAX
MOV RAX, <second_half>
MOVQ XMM1, RAX
MOVLHPS XMM0,XMM1