All the following instructions do the same thing: set %eax
to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax
All the following instructions do the same thing: set %eax
to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax
TL;DR summary:
xor same, same
is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD. In 64bit mode, still usexor r32, r32
, because writing a 32-bit reg zeros the upper 32.xor r64, r64
is a waste of a byte, because it needs a REX prefix.Even worse than that, Silvermont only recognizes
xor r32,r32
as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, usexor r10d,r10d
, notxor r10,r10
.Zeroing a vector register is usually best done with
pxor xmm, xmm
. That's typically what gcc does (even before use with FP instructions).xorps xmm, xmm
can make sense. It's one byte shorter thanpxor
, butxorps
needs execution port 5 on Intel Nehalem, whilepxor
can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel,
xorps
andpxor
are handled the same way (as vector-integer instructions).Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so
vpxor xmm, xmm, xmm
is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension.vpxor ymm, ymm, ymm
doesn't take any extra bytes to encode, though, and runs the same. The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.Some CPUs recognize
sub same,same
as a zeroing idiom likexor
, but all CPUs that recognize any zeroing idioms recognizexor
. Just usexor
so you don't have to worry about which CPU recognizes which zeroing idiom.xor
(being a recognized zeroing idiom, unlikemov reg, 0
) has some obvious and some subtle advantages (summary list, then I'll expand on those):mov reg,0
. (All CPUs)Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.
The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including
mov r32, imm32
, so with perfect decision-making by the scheduler (which doesn't happen in practice), HSW could still sustain 4 uops per clock even when they all need execution ports.See my answer on another question about zeroing registers for some more details.
Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that
xor
is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).On AMD Bulldozer-family CPUs,
mov immediate
runs on the same EX0/EX1 integer execution ports asxor
.mov reg,reg
can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage toxor
overmov
is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).
xor
will tag the register as having the upper parts zeroed, soxor eax, eax
/inc al
/inc eax
avoids the usual partial-register penalty that pre-IvB CPUs have. Even withoutxor
, IvB only needs a merging uop when the high 8bits (AH
) are modified and then the whole register is read, and Haswell even removes that.From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):
pg82 of that guide also confirms that
mov reg, 0
is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.xor
sets flags, which means you have to be careful when testing conditions. Sincesetcc
is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit
setcc r/m
, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.Ideally, you should use
xor
/ set flags /setcc
/ read full register:This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g.
cmp/jle
,sete
, and you either don't have a spare register, or you want to keep thexor
out of the not-taken code path altogether.There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using
mov reg, 0
/setcc
would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.Using
setcc
/movzx r32, r8
is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even considersahf
/lahf
orpushf
/popf
). IvB can eliminatemovzx r32, r8
(i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regularmov
instructions, somovzx
takes an execution unit and has non-zero latency, making test/setcc
/movzx
worse thanxor
/test/setcc
, but still at least as good as test/mov r,0
/setcc
(and much better on older CPUs).Using
setcc
/movzx
with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Usingmov reg, 0
/setcc
for zeroing / dependency-breaking is probably the best alternative whenxor
/test/setcc
isn't an option.Of course, if you don't need
setcc
's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)and
with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages overxor
, and many disadvantages.See http://agner.org/optimize/ for microarch documentation, including which zeroing idioms are recognized as dependency breaking (e.g.
sub same,same
is on some but not all CPUs, whilexor same,same
is recognized on all.)mov
does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's howmov
works).xor
only breaks dependency chains in the special-case where src and dest are the same register, which is whymov
is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize
xor
-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both. (See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and @Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-boundimul
chain.)If it really makes your code nicer or saves instructions, then sure, zero with
mov
to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not usingxor
, though.