Summary:
I'm looking for the fastest way to calculate
(int) x / (int) y
without getting an exception for y==0
. Instead I just want an arbitrary result.
Background:
When coding image processing algorithms I often need to divide by an (accumulated) alpha value. The most simple variant is plain C code with integer arithmetic. My problem is that I typically get a division by zero error for result pixels with alpha==0
. However this are exactly the pixels where the result doesn't matter at all: I don't care about color values of pixels with alpha==0
.
Details:
I'm looking for something like:
result = (y==0)? 0 : x/y;
or
result = x / MAX( y, 1 );
x and y are positive integers. The code is executed a huge number of times in a nested loop, so I'm looking for a way to get rid of the conditional branching.
When y does not exceed the byte range, I'm happy with the solution
unsigned char kill_zero_table[256] = { 1, 1, 2, 3, 4, 5, 6, 7, [...] 255 };
[...]
result = x / kill_zero_table[y];
But this obviously does not work well for bigger ranges.
I guess the final question is: Whats the fastest bit twiddling hack changing 0 to any other integer value, while leaving all other values unchanged?
Clarifications
I'm not 100% sure that branching is too expensive. However, different compilers are used, so I prefer benchmarking with little optimizations (which is indeed questionable).
For sure, compilers are great when it comes to bit twiddling, but I can't express the "don't care" result in C, so the compiler will never be able to use the full range of optimizations.
Code should be fully C compatible, main platforms are Linux 64 Bit with gcc & clang and MacOS.
Inspired by some of the comments I got rid of the branch on my Pentium and
gcc
compiler usingThe compiler basically recognizes that it can use a condition flag of the test in the addition.
As per request the assembly:
As this turned out to be such a popular question and answer, I'll elaborate a bit more. The above example is based on programming idiom that a compiler recognizes. In the above case a boolean expression is used in integral arithmetic and the use of condition flags are invented in hardware for this purpose. In general condition flags are only accessible in C through using idiom. That is why it so hard to make a portable multiple precision integer library in C without resorting to (inline) assembly. My guess is that most decent compilers will understand the above idiom.
Another way of avoiding branches, as also remarked in some of the above comments, is predicated execution. I therefore took philipp's first code and my code and ran it through the compiler from ARM and the GCC compiler for the ARM architecture, which features predicated execution. Both compilers avoid the branch in both samples of code:
Philipp's version with the ARM compiler:
Philipp's version with GCC:
My code with the ARM compiler:
My code with GCC:
All versions still need a branch to the division routine, because this version of the ARM doesn't have hardware for a division, but the test for
y == 0
is fully implemented through predicated execution.According to this link, you can just block the SIGFPE signal with
sigaction()
(I have not tried it myself, but I believe it should work).This is the fastest possible approach if divide by zero errors are extremely rare: you only pay for the divisions by zero, not for the valid divisions, the normal execution path is not changed at all.
However, the OS will be involved in every exception that's ignored, which is expensive. I think, you should have at least a thousand good divisions per division by zero that you ignore. If exceptions are more frequent than that, you'll likely pay more by ignoring the exceptions than by checking every value before the division.
Here are some concrete numbers, on Windows using GCC 4.7.2:
Note that I am intentionally not calling
srand()
, so thatrand()
always returns exactly the same results. Note also that-DCHECK=0
merely counts the zeroes, so that it is obvious how often appeared.Now, compiling and timing it various ways:
shows output that can be summarised in a table:
If zeroes are rare, the
-DCHECK=2
version performs badly. As zeroes start appearing more, the-DCHECK=2
case starts performing significantly better. Out of the other options, there really isn't much difference.For
-O3
, though, it is a different story:There, check 2 has no drawback compared the other checks, and it does keep the benefits as zeroes become more common.
You should really measure to see what happens with your compiler and your representative sample data, though.
Without knowing the platform there is no way to know the exact most efficient method, however, on a generic system this may close to the optimum (using Intel assembler syntax):
(assume divisor is in
ecx
and the dividend is ineax
)Four unbranched, single-cycle instructions plus the divide. The quotient will be in
eax
and the remainder will be inedx
at the end. (This kind of shows why you don't want to send a compiler to do a man's job).