Set all bits in CPU register to 1 efficiently

2019-01-09 12:23发布

问题:

To clear all bits you often see an exclusive or as in XOR eax, eax. Is there such a trick for the opposite too?

All I can think of is to invert the zeroes with an extra instruction.

回答1:

For most architectures with fixed-width instructions, the answer will probably be a boring one instruction mov of a sign-extended or inverted immediate, or a mov lo/high pair. e.g. on ARM, mvn r0, #0 (move-not). See gcc asm output for x86, ARM, ARM64, and MIPS, on the Godbolt compiler explorer. IDK anything about zseries asm or machine code.

In ARM, eor r0,r0,r0 is significantly worse than a mov-immediate. It depends on the old value, with no special-case handling. Memory dependency-ordering rules prevent an ARM uarch from special-casing it even if they wanted to. Same goes for most other RISC ISAs with weakly-ordered memory but that don't require barriers for memory_order_consume (in C++11 terminology).


x86 xor-zeroing is special because of its variable-length instruction set. Historically, 8086 xor ax,ax was fast directly because it was small. Since the idiom became widely used (and zeroing is much more common than all-ones), CPU designers gave it special support, and now xor eax,eax is faster than mov eax,0 on Intel Sandybridge-family and some other CPUs, even without considering direct and indirect code-size effects. See What is the best way to set a register to zero in x86 assembly: xor, mov or and? for as many micro-architectural benefits as I've been able to dig up.

If x86 had a fixed-width instruction-set, I wonder if mov reg, 0 would have gotten as much special treatment as xor-zeroing has? Perhaps, because dependency-breaking before writing the low8 or low16 is important.


The standard options for best performance:

  • mov eax, -1: 5 bytes, using the mov r32, imm32 encoding. (There is no sign-extending mov r32, imm8, unfortunately). Excellent performance on all CPUs. 6 bytes for r8-r15 (REX prefix).
  • mov rax, -1: 7 bytes, using the mov r/m64, sign-extended-imm32 encoding. (Not the REX.W=1 version of the eax version. That would be 10-byte mov r64, imm64). Excellent performance on all CPUs.

The weird options that save some code-size usually at the expense of performance:

  • xor eax,eax/dec rax (or not rax): 5 bytes (4 for 32-bit eax). Downside: two uops for the front-end. Still only one unfused-domain uop for the scheduler/execution units on recent Intel where xor-zeroing is handled in the front-end. mov-immediate always needs an execution unit. (But integer ALU throughput is rarely a bottleneck for instructions that can use any port; the extra front-end pressure is the problem)
  • xor ecx,ecx / lea eax, [rcx-1] 5 bytes total for 2 constants (6 bytes for rax): leaves a separate zeroed register. If you already want a zeroed register, there is almost no downside to this. lea can run on fewer ports than mov r,i on most CPUs, but since this is the start of a new dependency chain, the CPU can run it in any spare execution-port cycle after it issues.

    The same trick works for any two nearby constants, if you do the first one with mov reg, imm32 and the second with lea r32, [base + disp8]. disp8 has a range of -128 to +127, otherwise you need a disp32.

  • or eax, -1: 3 bytes (4 for rax), using the or r/m32, sign-extended-imm8 encoding. Downside: false dependency on the old value of the register.

  • push -1 / pop rax: 3 bytes. Slow but small. Recommended only for exploits / code-golf. Works for any sign-extended-imm8, unlike most of the others.

    Downsides:

    • uses store and load execution units, not ALU. (Possibly a throughput advantage in a rare cases on AMD Bulldozer-family where there are only two integer execution pipes, but decode/issue/retire throughput is higher than that. But don't try it without testing.)
    • store/reload latency means rax won't be ready for ~5 cycles after this executes on Skylake, for example.
    • (Intel): puts the stack-engine into rsp-modified mode, so the next time you read rsp directly it will take a stack-sync uop. (e.g. for add rsp, 28, or for mov eax, [rsp+8]).
    • The store could miss in cache, triggering extra memory traffic. (Possible if you haven't touched the stack inside a long loop).

Vector regs are different

Setting vector registers to all-ones with pcmpeqd xmm0,xmm0 is special-cased on most CPUs as dependency-breaking (not Silvermont/KNL), but still needs an execution unit to actually write the ones. pcmpeqb/w/d/q all work, but q is slower on some CPUs.

The AVX/AVX2 version of this is also the best choice there. Fastest way to set __m256 value to all ONE bits


AVX512 compares are only available with a mask register (like k0) as the destination, so compilers are currently using vpternlogd zmm0,zmm0,zmm0, 0xff as the 512b all-ones idiom. (0xff makes every element of the 3-input truth-table a 1). This is not special-cased as dependency-breaking on KNL or SKL, but it has 2-per-clock throughput on Skylake-AVX512. This beats using a narrower dependency-breaking AVX all-ones and broadcasting or shuffling it.

If you need to re-generate all-ones inside a loop, obviously the most efficient way is to use a vmov* to copy an all-ones register. This doesn't even use an execution unit on modern CPUs (but still takes front-end issue bandwidth). But if you're out of vector registers, loading a constant or [v]pcmpeq[b/w/d] are good choices.

For AVX512, it's worth trying VPMOVM2D zmm0, k0 or maybe VPBROADCASTD zmm0, eax. Each has only 1c throughput, but they should break dependencies on the old value of zmm0 (unlike vpternlogd). They require a mask or integer register which you initialized outside the loop with kxnorw k1,k0,k0 or mov eax, -1.


For AVX512 mask registers, kxnorw k1,k0,k0 works, but it's not dependency-breaking on current CPUs. Intel's optimization manual suggests using it for generating an all-ones before a gather instruction, but recommends avoiding using the same input register as the output. This avoids making an otherwise-independent gather dependent on a previous one in a loop. Since k0 is often unused, it's usually a good choice to read from.

I think vpcmpeqd k1, zmm0,zmm0 would work, but it's probably not special-cased as a k0=1 idiom with no dependency on zmm0. (To set all 64 bits instead of just the low 16, use AVX512BW vpcmpeqb)

On Skylake-AVX512, k instructions that operate on mask registers only run on a single port, even simple ones like kandw. (Also note that Skylake-AVX512 won't run vector uops on port1 when there are any 512b operations in the pipe, so execution unit throughput can be a real bottleneck.)

There is no kmov k0, imm, only moves from integer or memory. Probably there are no k instructions where same,same is detected as special, so the hardware in the issue/rename stage doesn't look for it for k registers.