I'm looking at code generated by GCC-4.8 for x86_64 and wondering if there is a better (faster) way to compute the minimum of three values.
Here's an excerpt from Python's collections module that computes the minimum of m
, rightindex+1
, and leftindex
:
ssize_t m = n;
if (m > rightindex + 1)
m = rightindex + 1;
if (m > leftindex)
m = leftindex;
GCC generates serially dependent code with CMOVs:
leaq 1(%rbp), %rdx
cmpq %rsi, %rdx
cmovg %rsi, %rdx
cmpq %rbx, %rdx
cmovg %rbx, %rdx
Is there faster code that can take advantage of processor out-of-order parallel execution by removing the data dependencies? I'm wondering if there are known tricks for computing the minimum of multiple values without using conditionals or predicated instructions. Am also wondering if there are some saturating arithmetic intrinsics that would help in this situation.
EDITS:
- As shown the code uses signed arithmetic, but an unsigned arithmetic answer would help as well.
- I asked about a minimum-of-three but also am interested in minimum-of-n where n is small.
- Linus's admonitions on CMOV: http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus
Here are benchmark results for the methods suggested. Processor is Intel Core-i7 2600K running at 4GHz. Units are nanoseconds per loop (smaller is better). Measurements include pseudo-random test data generation and function call overhead, in addition to the actual min3 code under test.
Benchmark source code
functions.s: the 3 solutions evaluated:
min3test.c, main program (mingw64/windows):
build command line:
Minimum of two unsigned numbers has classical solution:
This approach is probably faster than the solution with cmov, but for higher speed the instructions have to be separated by other instructions for parallel execution.
Implementation of this method for three numbers is possible:
Another try is to test the variant with conditional jumps. For the modern processors, it might be even faster, especially if the jumps are highly predictable:
The function
min(x,y,z)
is continuous, but its derivative is not. That derivative has norm 1 everywhere it's defined. There's just no way to express that as an arithmetic function.Saturating arithmetic has its own discontinuities, so the previous reasoning can't be used in that case. However, the saturation point is independent from the input. That in turn implies you'd need to scale the inputs, at which point I'm confident the resulting code won't be faster.
That's of course not a full proof of the non-existance of faster code, but you'd probably need an exhaustive search for that.