Why does MSVC emit a useless MOVSX before performi

2019-01-17 22:19发布

问题:

Compiling the following code in MSVC 2013, 64-bit release build, /O2 optimization:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

I get the following code - which has a really cool optimization using a 64-bit register as a lookup table with the bt (bit test) instruction.

    mov     rcx, 17596481020928             ; 0000100100002400H
    npad    5
$LL82@myFunc:
    movzx   eax, BYTE PTR [rsi]
    cmp     al, 44                          ; 0000002cH
    ja      SHORT $LN81@myFunc
    movsx   rax, al
    bt      rcx, rax
    jae     SHORT $LN81@myFunc
    inc     rsi
    jmp     SHORT $LL82@myFunc
$LN81@myFunc:
    ; code after loop...

But my question is: what is the purpose of the movsx rax, al after the first branch?

First we load a byte from the string into rax and zero-extend it:

movzx eax, BYTE PTR [rsi]

Then the cmp/ja pair performs an unsigned comparison between al and 44, and branches forwards if al is greater.

So now, we know 0 <= al <= 44 in unsigned numbers. Therefore, the highest bit of al could not possibly be set!

Nonetheless, the next instruction is movsx rax, al. This is a sign-extended move. But since:

  • al is the lowest byte of rax
  • we already know the other 7 bytes of rax are zeroed
  • we just proved that al's highest bit could not possibly be set

this movsx must be a no-op.

Why does MSVC do it? I'm assuming it's not for padding, since in that case another npad would make the meaning clearer. Is it flushing data dependencies or something?

(By the way, this bt optimization really makes me happy. Some interesting facts: it runs in 0.6x the time of the 4 cmp/je pairs you might expect, it's way faster than strspn or std::string::find_first_not_of, and it only happens in 64-bit builds even if the characters of interest have values under 32.)

回答1:

You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.

There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' ' by *s == 'A'.

Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.

There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.



回答2:

As Hans Passant already mentioned your code is a special case of the optimization. I did not look into the compiler/optimizer sources so I can only say what I expect to happen. However, here is my explanation for this.

Your code in C / C++:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

is equivalent to:

while (*s == 32 || *s == 44 || *s == 13 || *s == 12) {
    ++s;
}

or:

while (*s == 12 || *s == 13 || *s == 32 || *s == 44) {
   ++s;
}

The optimizer detects that there is an "if" with multiple (>=3 times) comparisons of the same character. Now, the optimizer generates as many 64bit patterns as needed (up to 4 patterns for an 8 bit char, 64*4 => all 256 possible values). Each of this bit patterns has an offset (= test-value to which the first bit in the pattern corresponds to) that needs to be subtracted from the value in test. In your case only one 64bit-pattern is needed for the values ranging from 12 to 44. So your code gets converted into some generic code like this:

#define ranged_bittest(pattern, value, min_val, max_val) \
  (val >= min_val) && (val <= max_val) && BT_with_offset(pattern, val, min_val)

while ( !ranged_bittest(<BITPATTERN>, *s, 12, 44) ) {
    ++s;
}

Now some other optimization takes ranged_bittest(<BITPATTERN>, *s, 12, 44); as it detects a bittest with start offset 12 and a pattern that can be safely shiftet left by 12 bits (as the upper 12 bits of pattern are zero). ranged_bittest(<BITPATTERN>, *s, 12, 44) => ranged_bittest(<BITPATTERN> << 12, *s, 0, 44)

This evolves to:

while (!(val >= 0 && val <= 44 && BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0))) {
    ++s;
}

The val >= 0 && gets optimized away (as it is assumed to be always true) and the "while" is translated to some "do"-loop with breaks; (which is better for branch prediction in most cases) resulting in:

do {
  if (val > 44) break;
  if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;

  ++s;
} while (true);

For any reason the optimization stops here (e.g. because there is a limit on how often further optimizations are applied to a code fragment for performance reasons).

Now in assembly the line

if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;`

is translated to something like:

MOV <reg64_A>, const_pattern
MOV <reg64_B>, value
SUB <reg64_B>, const_offset
BT <reg64_A>, <reg64_B>

The assembly optimizer reduces:

MOV <reg64_B>, value
SUB <reg64_B>, 0

to just

MOVSX <reg64_B>, value

in your special case this is:

MOVSX rax, al

The assembly optimizer (somehow) does not have enough information to completely eliminate the sign extension. Maybe the assembly optimizer is quite "dumb" (can't do any logical expression optimizations and stuff, just simple replacements) or the full optimization level isn't activated yet.

It can't therefore remove the movsx rax,al, as it doesn't have any meta information about the possible values of rax or al at this point of code.

I don't KNOW, if this is true, but this my best guess for this case...



回答3:

What struck me most when I first saw this code is how poorly in fact it is optimized.
Yes it's a neat trick to use a 64 bit register for a lookup table but ...

  • Why still use INC as opposed to the better ADD,1 ?
  • Why CMP ... JA ... MOVSX ... when we know that the BT instruction uses its source operand MOD 64 if its destination operand is a register (So there are 58 bits that need not be considered)?
  • Why not benefit from the fact that conditional branches are predicted to go backwards?
  • Why not reduce the number of branches?

I think a true assembler programmer would have written it more like this :

  mov     rcx, 0FFFFEFFEFFFFDBFFh  ;~0000100100002400h
  sub     rsi, 1
  npad    1
$LL82@myFunc:
  add     rsi, 1
  movzx   eax, BYTE PTR [rsi]   ;mov al,[rsi]
  test    al, 11000000b
  setz    bl
  test    bl, bl
  bt      rcx, rax
  ja      SHORT $LL82@myFunc

ja jumps if (CF or ZF) = 0

  • ZF=1 for AL=[64,255] and don't care about CF
  • ZF=0 for AL=[0,63] and CF=0 for AL={10,13,32,44}

For all the ASCII's in the range [64,255] will the test al, 11000000b instruction give a non-zero result (ZF=0). Because the combo setz bl test bl, bl is then used to flip the ZF to 1, there's no longer any chance for the ja instruction to continue the loop.
On the contrary, for all the ASCII's in the range [0,63] will the ZF end up being 0, thus allowing ja to fully interpret the CF gotten from the bt rcx, rax instruction.

Perhaps we are expecting to much from our optimizing compilers?