Consider the following x86 assembly:
; something that sets rax
mov rcx, [rdi]
xor rax, rcx
xor rax, rcx
At the end of the sequence, rax
has the same value as it had on entry, but from the point of view of the CPU its value depends on the value loaded from memory into rcx
. In particular, subsequent use of rax
won't start until that load and the two xor
instructions complete.
Is there any way to achieve this effect more efficiently than the two-xor
sequence, e.g., with a single one-uop one-cycle-latency instruction? It's OK if some constant value needs to be set up once before the sequence (e.g., having a zeroed register).
With only 1 uop / 1c latency on the critical path of the target register:
# target=rax extra source=rcx
mov edx, ecx ; no latency
and edx, 0 ; BMI1 ANDN could mov+and in 1 uop, port 1 or 5 only on SnB-family (Ryzen: any)
or rax, rdx
AND with zero is not special-cased as a dep-breaking zeroing idiom on any CPUs, AFAIK.
front-end uops: 3 (or 2 with BMI1). Latency:
- from rcx to rax: 2c (with mov-elimination or BMI1).
- from rax(input) to rax(output): 1c
With a zeroed register, if it's ok to couple all the dep chains into that one register (unlike the ANDN version that only reads an all-ones register):
and edx, ecx # 0 &= ecx
or rax, rdx # rax |= 0
To test a function's latency (not throughput) but still feed it the same input repeatedly:
.loop:
call func ; arg in RDI, return in RAX
mov rdi, rbx ; arg for next iter, off the critical path
and eax, 0 ; 1c latency
or rdi, rax ; 1c latency
jmp .loop
If the function is pure, we can do 1c / 1uop
Actually it just has to return a known value for a given input. This also works if its impurities are limited to having other side-effects / outputs.
Instead of XOR twice after getting the result, set things up so we already have an XOR that we can unscramble with only one more XOR. Or use addition because LEA lets us copy-and-add in one instruction, saving a mov
that wouldn't be on the critical path.
mov rdi, rbx ; original input
call func
sub rbx, rax ; RBX = input - output
.loop:
call func
lea rdi, [rbx + rax] ; RDI = (input-output) + output = input
jmp .loop
@RossRidge's suggestion is only 1 uop on SnB-family CPUs, but only runs on port 1:
shld rax, rcx, 0
3c latency, 1 uop for port 1 on HSW/SKL. Agner Fog reports 1c latency for IvB, but 3c for HSW/BDW/SKL.
shld r,r,i
is 2 uops on older Intel, and significantly slower on AMD, like 6 uops / 3c latency on Piledriver / Ryzen.
Note that instlatx64 reports 1c latency / 0.5c throughput for shld/shrd on Haswell/Skylake (like single-register shifts), but I tested myself and it's definitely 3c latency / 1c throughput. Reported as an instlatx64 bug on their github page.
SHLD could also be interesting for copying a 32-bit register with a dependency on another. e.g. @BeeOnRope describes wanting to call a function repeatedly with the same input value in RDI, but with a dependency on the result in RAX. If we only care about EDI, then
; RBX = input<<32
call func
mov edi, eax ; 0 latency with mov-elimination
shld rdi, rbx, 32 ; EDI = the high 32 bits of RBX, high bits of RDI = old EDI.
Of course that's pointless vs. this, which doesn't need mov-elimination
call func
mov rdi, rbx ; off critical path
shld rdi, rax, 0 ; possibly 1c latency on SnB / IvB. 3 on HSW/SKL
A modification of @DavidWholford's suggestion works, too:
test ecx,ecx ; CF=0, with a false dependency on RCX
adc rax, 0 ; dependent on CF
2 uops on Haswell/Broadwell/Skylake and AMD. 3 uops on Intel P6-family, and maybe SnB/IvB. Latency:
- from rcx to rax: 2c on HSW and later, 3 with 2-uop adc
- from rax to rax: 1c on HSW and later, 2 with 2-uop adc
ADC on Haswell and earlier is normally 2 uops, but adc
with immediate 0 is special-cased on Haswell to only be 1 uop / 1c. adc eax,0
is always 2c latency on Core 2. The first uarch with this optimization might be SnB, but hopefully we get an answer on Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?
test
clears CF regardless of the value, but I think (untested) that CF still carries a dependency on the source register. If not, then perhaps use TEST / ADOX could be useful on Broadwell and later. (Because CF is renamed separately on most CPUs, but OF might only be part of the same bundle as ZF / SF and other flags that do depend on the AND result.)