Is there a faster way than x >= start && x <= end
in C or C++ to test if an integer is between two integers?
UPDATE: My specific platform is iOS. This is part of a box blur function that restricts pixels to a circle in a given square.
UPDATE: After trying the accepted answer, I got an order of magnitude speedup on the one line of code over doing it the normal x >= start && x <= end
way.
UPDATE: Here is the after and before code with assembler from XCode:
NEW WAY
// diff = (end - start) + 1
#define POINT_IN_RANGE_AND_INCREMENT(p, range) ((p++ - range.start) < range.diff)
Ltmp1313:
ldr r0, [sp, #176] @ 4-byte Reload
ldr r1, [sp, #164] @ 4-byte Reload
ldr r0, [r0]
ldr r1, [r1]
sub.w r0, r9, r0
cmp r0, r1
blo LBB44_30
OLD WAY
#define POINT_IN_RANGE_AND_INCREMENT(p, range) (p <= range.end && p++ >= range.start)
Ltmp1301:
ldr r1, [sp, #172] @ 4-byte Reload
ldr r1, [r1]
cmp r0, r1
bls LBB44_32
mov r6, r0
b LBB44_33
LBB44_32:
ldr r1, [sp, #188] @ 4-byte Reload
adds r6, r0, #1
Ltmp1302:
ldr r1, [r1]
cmp r0, r1
bhs LBB44_36
Pretty amazing how reducing or eliminating branching can provide such a dramatic speed up.
This answer is to report on a testing done with the accepted answer. I performed a closed range test on a large vector of sorted random integer and to my surprise the basic method of ( low <= num && num <= high) is in fact faster than the accepted answer above! Test was done on HP Pavilion g6 (AMD A6-3400APU with 6GB ram. Here's the core code used for testing:
compared with the following which is the accepted answer above:
Pay attention that randVec is a sorted vector. For any size of MaxNum the first method beats the second one on my machine!
There's an old trick to do this with only one comparison/branch. Whether it'll really improve speed may be open to question, and even if it does, it's probably too little to notice or care about, but when you're only starting with two comparisons, the chances of a huge improvement are pretty remote. The code looks like:
With a typical, modern computer (i.e., anything using twos complement), the conversion to unsigned is really a nop -- just a change in how the same bits are viewed.
Note that in a typical case, you can pre-compute
upper-lower
outside a (presumed) loop, so that doesn't normally contribute any significant time. Along with reducing the number of branch instructions, this also (generally) improves branch prediction. In this case, the same branch is taken whether the number is below the bottom end or above the top end of the range.As to how this works, the basic idea is pretty simple: a negative number, when viewed as an unsigned number, will be larger than anything that started out as a positive number.
In practice this method translates
number
and the interval to the point of origin and checks ifnumber
is in the interval[0, D]
, whereD = upper - lower
. Ifnumber
below lower bound: negative, and if above upper bound: larger thanD
.Is it not possible to just perform a bitwise operation on the integer?
Since it has to be between 0 and 128, if the 8th bit is set (2^7) it is 128 or more. The edge case will be a pain, though, since you want an inclusive comparison.
It's rare to be able to do significant optimizations to code on such a small scale. Big performance gains come from observing and modifying the code from a higher level. You may be able to eliminate the need for the range test altogether, or only do O(n) of them instead of O(n^2). You may be able to re-order the tests so that one side of the inequality is always implied. Even if the algorithm is ideal, gains are more likely to come when you see how this code does the range test 10 million times and you find a way to batch them up and use SSE to do many tests in parallel.
It depends on how many times you want to perform the test over the same data.
If you are performing the test a single time, there probably isn't a meaningful way to speed up the algorithm.
If you are doing this for a very finite set of values, then you could create a lookup table. Performing the indexing might be more expensive, but if you can fit the entire table in cache, then you can remove all branching from the code, which should speed things up.
For your data the lookup table would be 128^3 = 2,097,152. If you can control one of the three variables so you consider all instances where
start = N
at one time, then the size of the working set drops down to128^2 = 16432
bytes, which should fit well in most modern caches.You would still have to benchmark the actual code to see if a branchless lookup table is sufficiently faster than the obvious comparisons.