x86 assembly - optimization of clamping rax to [ 0

2019-02-27 18:17发布

问题:

I'm writing a simple assembler procedure, which, naturally, aims to be as quick as possible. However, a certain part, which is located in the most nested loop, doesn't seem 'right' and I believe it is possible to come up with cleverer and quicker implementation, maybe even without using conditional jumps. The code implements a simple thing:

if rax < 0 then rax := 0 else if rax >= r12 then rax := r12 - 1

And here's my naive implementation:

cmp rax, 0
jge offsetXGE
   mov rax, 0
   jmp offsetXReady
offsetXGE:
   cmp rax, r12
   jl offsetXReady
   mov rax, r12
   dec rax
offsetXReady:

Any ideas are welcome, even those using MMX and some masking tricks.

EDIT: To answer some questions in comments - yes we can assume that r12 > 0 but rax can be negative.

回答1:

One compare-and-branch, and an LEA + cmov.


It's not worth moving scalar data to the vector regs for one or two instructions, and then moving it back. If you could usefully do whole vectors at a time, then you could use PMINSD/PMAXSD to clamp values to a range like this.

In your original, a couple things are clearly sub-optimal. The first two only matter for code-size most of the time, but LEA for a non-destructive add is a small but clear win:

  • cmp eax, 0 should be test eax, eax

  • mov rax, 0 should be xor eax, eax. And no, eax is not a typo for rax.

  • mov rax, r12 / dec rax should be lea rax, [r12 - 1].

See the links in the x86 wiki, esp. Agner Fog's guides.


After searching around some, I found a similar question about optimal x86 asm for clamping to a range. I got some inspiration from that, but mostly rewrote it with cmov instead of setcc/dec/and.

I think this is about as good as you can get, for a recent Intel or AMD CPU:

You need a register (or memory location) holding 0, or else an extra instruction to mov reg, 0.

    ...
    cmp  rax, r12
    jae  .clamp      ; favour the fast-path more heavily by making it the not-taken case
.clamp_finished:     ; rdx is clobbered, since the clamp code uses a scratch reg

    ...
    ret

.clamp:   
    ; flags still set from the cmp rax, r12
    ; we only get here if rax is >= r12 (`ge` signed compare), or negative (`l` rax < r12, signed)

    ; mov r15d, 0        ; or zero it outside the loop so it can be used when needed.  Can't xor-zero because we need to preserve flags

    lea    rax, [r12-1]  ; still doesn't modify flags
    cmovl  eax, r15d     ; rax=0 if  orig_rax<r12 (signed), which means we got here because orig_rax<0
    jmp  .clamp_finished

quick perf analysis for Intel Haswell:

  • Fast path: one not-taken compare-and-branch uop. Latency for rax: 0 cycles.

  • Clamping-needed case: One taken compare-and-branch uop, plus 4 more uops (lea, 2 for cmov, 1 more to jmp back.) Latency for rax: 3 cycles from the later of rax and r12 (cmp->flags, flags->cmov).

Obviously you can use jb instead of jae to skip over the clamping lea/cmov, instead of pulling them out of the main flow. See the section below for motivation for that. (And/or see Anatolyg's excellent answer, which covers this. I got the cool trick of using jb to do the [0 .. limit] with one branch from Anatolyg's answer, too).

I think the version using cmov is the best bet here, even though cmov has a lot of downsides and isn't always faster. Its input operands were already needed, so it's not adding much latency (except in the clamp-to-zero case with branches, see below).

An alternative branchy implementation of the .clamp code that doesn't need a zeroed-register would be:

.clamp:
    lea    rax, [r12-1]
    jge  .clamp_finished
    xor    eax, eax
    jmp  .clamp_finished

It still computes a result it might throw away, cmov-style. However, the following xor starts a fresh dependency chain, so it doesn't have to wait for lea to write rax.


An important question is how often you expect these branches to be taken. If there's a common case (e.g. the no-clamping case), make that the fast-path through the code (as few instructions and as few taken-branches as possible). Depending on how infrequently branches are taken, it can be worth putting the code for the uncommon case off at the end of the function.

func:
    ...
    test
    jcc .unlikely
    ...        
.ret_from_unlikely:
    ...
    ... ;; lots of code
    ret

.unlikely:
    xor   eax,eax
    jmp .ret_from_unlikely   ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.

Gcc does this, I think when it decides a branch is unlikely to be taken. So instead of having the typical case take a branch that skips some instructions, the common case falls through. Typically, the default branch prediction is not-taken for forward jumps, so this never even needs a branch-predictor entry until it sees the unlikely case.


random thoughts: The code

if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 }  // If r12 can be zero, the else matters

is equivalent to

eax = min(eax, r12-1);
eax = max(eax, 0);

r12 can't be negative, but OP didn't say it couldn't be zero. This ordering preserves the if/else semantics. (edit: actually OP did say you can assume r12>0, not >=0.) If we had a fast min/max in asm, we could use it here. vector-max is a single-instruction, but scalar takes more code.



回答2:

A common trick (compilers use it) is to make an unsigned comparison:

    cmp rax, r12
    jb done
    ...
    ...
done:

Here, if rax is negative, when interpreted as an unsigned number (by jb, "jump if below") it looks like a large number (greater than 263), so unsigned comparison lumps together the two "exceptional" cases (less than 0 and too big).

If the exceptional case is very rare, the performance of the code denoted by ... doesn't matter much, and the usual case contains one conditional branch, usually taken. If you want to improve it even more, you can rearrange the code like this:

    cmp rax, r12
    jb work_needed
done:
    (your code continued here)

work_needed:
    jl upper_limit_done
    lea rax, [r12 - 1]
upper_limit_done:
    test rax, rax
    jns lower_limit_done
    xor rax, rax
lower_limit_done:
    jmp done

Here, the usual path contains a branch that is usually not taken. This probably provides some minor improvement, on the expense of a slower exceptional case.



回答3:

Less jumps, cleaner I think :

xor  rdx, rdx
test rax, rax
js   OK
lea  rdx, [r12 - 1]
cmp  rax, r12
jge  OK
mov  rdx, rax
OK:
mov  rax, rdx