Compiling the following code in MSVC 2013, 64-bit release build, /O2
optimization:
while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
++s;
}
I get the following code - which has a really cool optimization using a 64-bit register as a lookup table with the bt
(bit test) instruction.
mov rcx, 17596481020928 ; 0000100100002400H
npad 5
$LL82@myFunc:
movzx eax, BYTE PTR [rsi]
cmp al, 44 ; 0000002cH
ja SHORT $LN81@myFunc
movsx rax, al
bt rcx, rax
jae SHORT $LN81@myFunc
inc rsi
jmp SHORT $LL82@myFunc
$LN81@myFunc:
; code after loop...
But my question is: what is the purpose of the movsx rax, al
after the first branch?
First we load a byte from the string into rax
and zero-extend it:
movzx eax, BYTE PTR [rsi]
Then the cmp
/ja
pair performs an unsigned comparison between al
and 44
, and branches forwards if al
is greater.
So now, we know 0 <= al <= 44
in unsigned numbers. Therefore, the highest bit of al
could not possibly be set!
Nonetheless, the next instruction is movsx rax, al
. This is a sign-extended move. But since:
al
is the lowest byte of rax
- we already know the other 7 bytes of
rax
are zeroed
- we just proved that
al
's highest bit could not possibly be set
this movsx
must be a no-op.
Why does MSVC do it? I'm assuming it's not for padding, since in that case another npad
would make the meaning clearer. Is it flushing data dependencies or something?
(By the way, this bt
optimization really makes me happy. Some interesting facts: it runs in 0.6x the time of the 4 cmp
/je
pairs you might expect, it's way faster than strspn
or std::string::find_first_not_of
, and it only happens in 64-bit builds even if the characters of interest have values under 32.)
You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.
There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' '
by *s == 'A'
.
Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.
There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.
As Hans Passant already mentioned your code is a special case of the optimization. I did not look into the compiler/optimizer sources so I can only say what I expect to happen. However, here is my explanation for this.
Your code in C / C++:
while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
++s;
}
is equivalent to:
while (*s == 32 || *s == 44 || *s == 13 || *s == 12) {
++s;
}
or:
while (*s == 12 || *s == 13 || *s == 32 || *s == 44) {
++s;
}
The optimizer detects that there is an "if" with multiple (>=3 times) comparisons of the same character. Now, the optimizer generates as many 64bit patterns as needed (up to 4 patterns for an 8 bit char, 64*4 => all 256 possible values). Each of this bit patterns has an offset (= test-value to which the first bit in the pattern corresponds to) that needs to be subtracted from the value in test.
In your case only one 64bit-pattern is needed for the values ranging from 12 to 44. So your code gets converted into some generic code like this:
#define ranged_bittest(pattern, value, min_val, max_val) \
(val >= min_val) && (val <= max_val) && BT_with_offset(pattern, val, min_val)
while ( !ranged_bittest(<BITPATTERN>, *s, 12, 44) ) {
++s;
}
Now some other optimization takes ranged_bittest(<BITPATTERN>, *s, 12, 44);
as it detects a bittest with start offset 12 and a pattern that can be safely shiftet left by 12 bits (as the upper 12 bits of pattern are zero).
ranged_bittest(<BITPATTERN>, *s, 12, 44)
=> ranged_bittest(<BITPATTERN> << 12, *s, 0, 44)
This evolves to:
while (!(val >= 0 && val <= 44 && BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0))) {
++s;
}
The val >= 0 &&
gets optimized away (as it is assumed to be always true) and the "while" is translated to some "do"-loop with breaks; (which is better for branch prediction in most cases) resulting in:
do {
if (val > 44) break;
if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;
++s;
} while (true);
For any reason the optimization stops here (e.g. because there is a limit on how often further optimizations are applied to a code fragment for performance reasons).
Now in assembly the line
if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;`
is translated to something like:
MOV <reg64_A>, const_pattern
MOV <reg64_B>, value
SUB <reg64_B>, const_offset
BT <reg64_A>, <reg64_B>
The assembly optimizer reduces:
MOV <reg64_B>, value
SUB <reg64_B>, 0
to just
MOVSX <reg64_B>, value
in your special case this is:
MOVSX rax, al
The assembly optimizer (somehow) does not have enough information to completely eliminate the sign extension. Maybe the assembly optimizer is quite "dumb" (can't do any logical expression optimizations and stuff, just simple replacements) or the full optimization level isn't activated yet.
It can't therefore remove the movsx rax,al
, as it doesn't have any meta information about the possible values of rax
or al
at this point of code.
I don't KNOW, if this is true, but this my best guess for this case...
What struck me most when I first saw this code is how poorly in fact it is optimized.
Yes it's a neat trick to use a 64 bit register for a lookup table but ...
- Why still use
INC
as opposed to the better ADD,1
?
- Why
CMP ...
JA ...
MOVSX ...
when we know that the BT instruction uses its source operand MOD 64 if its destination operand is a register (So there are 58 bits that need not be considered)?
- Why not benefit from the fact that conditional branches are predicted to go backwards?
- Why not reduce the number of branches?
I think a true assembler programmer would have written it more like this :
mov rcx, 0FFFFEFFEFFFFDBFFh ;~0000100100002400h
sub rsi, 1
npad 1
$LL82@myFunc:
add rsi, 1
movzx eax, BYTE PTR [rsi] ;mov al,[rsi]
test al, 11000000b
setz bl
test bl, bl
bt rcx, rax
ja SHORT $LL82@myFunc
ja
jumps if (CF or ZF) = 0
- ZF=1 for AL=[64,255] and don't care about CF
- ZF=0 for AL=[0,63] and CF=0 for AL={10,13,32,44}
For all the ASCII's in the range [64,255] will the test al, 11000000b
instruction give a non-zero result (ZF=0). Because the combo setz bl
test bl, bl
is then used to flip the ZF to 1, there's no longer any chance for the ja
instruction to continue the loop.
On the contrary, for all the ASCII's in the range [0,63] will the ZF end up being 0, thus allowing ja
to fully interpret the CF gotten from the bt rcx, rax
instruction.
Perhaps we are expecting to much from our optimizing compilers?