I'm writing a simple assembler procedure, which, naturally, aims to be as quick as possible. However, a certain part, which is located in the most nested loop, doesn't seem 'right' and I believe it is possible to come up with cleverer and quicker implementation, maybe even without using conditional jumps. The code implements a simple thing:
if rax < 0 then
rax := 0
else if rax >= r12 then
rax := r12 - 1
And here's my naive implementation:
cmp rax, 0
jge offsetXGE
mov rax, 0
jmp offsetXReady
offsetXGE:
cmp rax, r12
jl offsetXReady
mov rax, r12
dec rax
offsetXReady:
Any ideas are welcome, even those using MMX and some masking tricks.
EDIT: To answer some questions in comments - yes we can assume that r12 > 0 but rax can be negative.
One compare-and-branch, and an LEA + cmov.
It's not worth moving scalar data to the vector regs for one or two instructions, and then moving it back. If you could usefully do whole vectors at a time, then you could use PMINSD/PMAXSD
to clamp values to a range like this.
In your original, a couple things are clearly sub-optimal. The first two only matter for code-size most of the time, but LEA
for a non-destructive add is a small but clear win:
cmp eax, 0
should be test eax, eax
mov rax, 0
should be xor eax, eax
. And no, eax
is not a typo for rax
.
mov rax, r12 / dec rax
should be lea rax, [r12 - 1]
.
See the links in the x86 wiki, esp. Agner Fog's guides.
After searching around some, I found a similar question about optimal x86 asm for clamping to a range. I got some inspiration from that, but mostly rewrote it with cmov instead of setcc/dec/and
.
I think this is about as good as you can get, for a recent Intel or AMD CPU:
You need a register (or memory location) holding 0
, or else an extra instruction to mov reg, 0
.
...
cmp rax, r12
jae .clamp ; favour the fast-path more heavily by making it the not-taken case
.clamp_finished: ; rdx is clobbered, since the clamp code uses a scratch reg
...
ret
.clamp:
; flags still set from the cmp rax, r12
; we only get here if rax is >= r12 (`ge` signed compare), or negative (`l` rax < r12, signed)
; mov r15d, 0 ; or zero it outside the loop so it can be used when needed. Can't xor-zero because we need to preserve flags
lea rax, [r12-1] ; still doesn't modify flags
cmovl eax, r15d ; rax=0 if orig_rax<r12 (signed), which means we got here because orig_rax<0
jmp .clamp_finished
quick perf analysis for Intel Haswell:
Fast path: one not-taken compare-and-branch uop. Latency for rax: 0 cycles.
Clamping-needed case: One taken compare-and-branch uop, plus 4 more uops (lea, 2 for cmov, 1 more to jmp back.) Latency for rax: 3 cycles from the later of rax and r12 (cmp->flags, flags->cmov).
Obviously you can use jb
instead of jae
to skip over the clamping lea/cmov
, instead of pulling them out of the main flow. See the section below for motivation for that. (And/or see Anatolyg's excellent answer, which covers this. I got the cool trick of using jb
to do the [0 .. limit]
with one branch from Anatolyg's answer, too).
I think the version using cmov is the best bet here, even though cmov has a lot of downsides and isn't always faster. Its input operands were already needed, so it's not adding much latency (except in the clamp-to-zero case with branches, see below).
An alternative branchy implementation of the .clamp
code that doesn't need a zeroed-register would be:
.clamp:
lea rax, [r12-1]
jge .clamp_finished
xor eax, eax
jmp .clamp_finished
It still computes a result it might throw away, cmov-style. However, the following xor starts a fresh dependency chain, so it doesn't have to wait for lea
to write rax
.
An important question is how often you expect these branches to be taken. If there's a common case (e.g. the no-clamping case), make that the fast-path through the code (as few instructions and as few taken-branches as possible). Depending on how infrequently branches are taken, it can be worth putting the code for the uncommon case off at the end of the function.
func:
...
test
jcc .unlikely
...
.ret_from_unlikely:
...
... ;; lots of code
ret
.unlikely:
xor eax,eax
jmp .ret_from_unlikely ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.
Gcc does this, I think when it decides a branch is unlikely to be taken. So instead of having the typical case take a branch that skips some instructions, the common case falls through. Typically, the default branch prediction is not-taken for forward jumps, so this never even needs a branch-predictor entry until it sees the unlikely case.
random thoughts: The code
if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 } // If r12 can be zero, the else matters
is equivalent to
eax = min(eax, r12-1);
eax = max(eax, 0);
r12
can't be negative, but OP didn't say it couldn't be zero. This ordering preserves the if/else semantics. (edit: actually OP did say you can assume r12>0, not >=0.) If we had a fast min/max in asm, we could use it here. vector-max is a single-instruction, but scalar takes more code.
A common trick (compilers use it) is to make an unsigned comparison:
cmp rax, r12
jb done
...
...
done:
Here, if rax
is negative, when interpreted as an unsigned number (by jb
, "jump if below") it looks like a large number (greater than 263), so unsigned comparison lumps together the two "exceptional" cases (less than 0 and too big).
If the exceptional case is very rare, the performance of the code denoted by ...
doesn't matter much, and the usual case contains one conditional branch, usually taken. If you want to improve it even more, you can rearrange the code like this:
cmp rax, r12
jb work_needed
done:
(your code continued here)
work_needed:
jl upper_limit_done
lea rax, [r12 - 1]
upper_limit_done:
test rax, rax
jns lower_limit_done
xor rax, rax
lower_limit_done:
jmp done
Here, the usual path contains a branch that is usually not taken. This probably provides some minor improvement, on the expense of a slower exceptional case.
Less jumps, cleaner I think :
xor rdx, rdx
test rax, rax
js OK
lea rdx, [r12 - 1]
cmp rax, r12
jge OK
mov rdx, rax
OK:
mov rax, rdx