I have the following x86 assembly code:
movl 8(%ebp), %edx //get an argument from the caller
movl $0, %eax
testl %edx, %edx
je .L1
.L2: // what's the purpose of this loop body?
xorl %edx, %eax
shrl $1, %edx
jne .L2
.L1:
andl $1, %eax
The corresponding C code that the textbook gives as follows
int f1(unsigned x)
{
int y = 0;
while(x != 0) {
__________;
}
return __________;
}
The book asks readers to fill the blank and answer the question of "What does it do?"
I can't combine the loop body in one C expression. I can tell what the loop body does, but I have no idea about its purpose. The textbook also says that %eax here stores the return value. So...what's the purpose of
andl $1, %eax
I also have no idea.
It looks like the purpose of the whole loop is to XOR all the bits together in the 32-bit arg. i.e. calculate the parity.
Working backwards from the last instruction (
and $1,%eax
), we know that only the low bit of the result matters.With that in mind, the
xor %edx,%eax
becomes clearer: xor the current low bit of%edx
into%eax
. The high garbage doesn't matter.The
shr
loops until all ofx
's bits have been shifted out. We could always loop 32 times to get all the bits, but that would be less efficient than stopping oncex
is 0. (Because of how XOR works, we don't need to actual XOR in the 0 bits; that has no effect.)Once we know what the function does, filling in the C becomes an exercise in clever / compact C syntax. I thought at first that
y ^= (x>>=1);
would fit inside the loop, but that shiftsx
before using it the first time.The only way I see to do it in one C statement is with the
,
operator (which does introduce a sequence point, so it's safe to readx
on the left side and modify it on the right side of a,
). So,y ^= x, x>>=1;
fits.Or, for more readable code, just cheat and put two statements on the same line with a
;
.This compiles to essentially the same asm as shown in the question, using gcc5.3 -O3 on the Godbolt compiler explorer. The question's code de-optimizes the xor-zeroing idiom to a
mov $0, %eax
, and optimizes gcc's silly duplication ofret
instructions. (Or maybe used an earlier version of gcc that didn't do that.)The loop is very inefficient: this is an efficient way:
We don't need a loop with O(n) complexity (where n is the width in bits of
x
). Instead, we can get O(log2(n)) complexity, and actually take advantage of x86 tricks to only do the first 2 steps of that.I've left off the operand-size suffix for instructions where it's determined by the registers. (Except for
xorw
to make the 16-bit xor explicit.)Yes, that's right, x86 has a parity flag (
PF
) that's updated from the low 8 bits of the result of every instruction that "sets flags according to the result", likexor
.We use the
np
condition becausePF
= 1 means even parity: xor of all bits = 0. We need the inverse to return 0 for even parity.To take advantage of it, we do a SIMD-style horizontal reduction by bringing the high half down to the low half and combining, repeating twice to reduce 32 bits to 8 bits.
Zeroing eax (with an xor) before the instruction that sets flags is slightly more efficient than doing set-flags /
setp %al
/movzbl %al, %eax
, as I explained in What is the best way to set a register to zero in x86 assembly: xor, mov or and?.Or, as @EOF points out, if the CPUID
POPCNT
feature bit is set, you can use popcnt and test the low bit to see if the number of set bits is even or odd. (Another way to look at this: xor is add-without-carry, so the low bit is the same whether you xor all the bits together or add all the bits together horizontally).GNU C also has
__builtin_parity
and__builtin_popcnt
which use the hardware instruction if you tell the compiler that the compile target supports it (with-march=...
or-mpopcnt
), but otherwise compile to an efficient sequence for the target machine. The Intel intrinsics always compile to the machine instruction, not a fallback sequence, and it's a compile-time error to use them without the appropriate-mpopcnt
target option.Unfortunately gcc doesn't recognize the pure-C loop as being a parity calculation and optimize it into this. Some compilers (like clang and probably gcc) can recognize some kinds of popcount idioms, and optimize them into the
popcnt
instruction, but that kind of pattern recognition doesn't happen in this case. :(See these on godbolt.
See also other links in the x86 tag wiki.