Fastest way to determine if an integer is between

2019-01-02 16:59发布

问题:

Is there a faster way than x >= start && x <= end in C or C++ to test if an integer is between two integers?

UPDATE: My specific platform is iOS. This is part of a box blur function that restricts pixels to a circle in a given square.

UPDATE: After trying the accepted answer, I got an order of magnitude speedup on the one line of code over doing it the normal x >= start && x <= end way.

UPDATE: Here is the after and before code with assembler from XCode:

NEW WAY

// diff = (end - start) + 1
#define POINT_IN_RANGE_AND_INCREMENT(p, range) ((p++ - range.start) < range.diff)

Ltmp1313:
 ldr    r0, [sp, #176] @ 4-byte Reload
 ldr    r1, [sp, #164] @ 4-byte Reload
 ldr    r0, [r0]
 ldr    r1, [r1]
 sub.w  r0, r9, r0
 cmp    r0, r1
 blo    LBB44_30

OLD WAY

#define POINT_IN_RANGE_AND_INCREMENT(p, range) (p <= range.end && p++ >= range.start)

Ltmp1301:
 ldr    r1, [sp, #172] @ 4-byte Reload
 ldr    r1, [r1]
 cmp    r0, r1
 bls    LBB44_32
 mov    r6, r0
 b      LBB44_33
LBB44_32:
 ldr    r1, [sp, #188] @ 4-byte Reload
 adds   r6, r0, #1
Ltmp1302:
 ldr    r1, [r1]
 cmp    r0, r1
 bhs    LBB44_36

Pretty amazing how reducing or eliminating branching can provide such a dramatic speed up.

回答1:

There's an old trick to do this with only one comparison/branch. Whether it'll really improve speed may be open to question, and even if it does, it's probably too little to notice or care about, but when you're only starting with two comparisons, the chances of a huge improvement are pretty remote. The code looks like:

// use a < for an inclusive lower bound and exclusive upper bound
// use <= for an inclusive lower bound and inclusive upper bound
// alternatively, if the upper bound is inclusive and you can pre-calculate
//  upper-lower, simply add + 1 to upper-lower and use the < operator.
    if ((unsigned)(number-lower) <= (upper-lower))
        in_range(number);

With a typical, modern computer (i.e., anything using twos complement), the conversion to unsigned is really a nop -- just a change in how the same bits are viewed.

Note that in a typical case, you can pre-compute upper-lower outside a (presumed) loop, so that doesn't normally contribute any significant time. Along with reducing the number of branch instructions, this also (generally) improves branch prediction. In this case, the same branch is taken whether the number is below the bottom end or above the top end of the range.

As to how this works, the basic idea is pretty simple: a negative number, when viewed as an unsigned number, will be larger than anything that started out as a positive number.

In practice this method translates number and the interval to the point of origin and checks if number is in the interval [0, D], where D = upper - lower. If number below lower bound: negative, and if above upper bound: larger than D.



回答2:

It's rare to be able to do significant optimizations to code on such a small scale. Big performance gains come from observing and modifying the code from a higher level. You may be able to eliminate the need for the range test altogether, or only do O(n) of them instead of O(n^2). You may be able to re-order the tests so that one side of the inequality is always implied. Even if the algorithm is ideal, gains are more likely to come when you see how this code does the range test 10 million times and you find a way to batch them up and use SSE to do many tests in parallel.



回答3:

It depends on how many times you want to perform the test over the same data.

If you are performing the test a single time, there probably isn't a meaningful way to speed up the algorithm.

If you are doing this for a very finite set of values, then you could create a lookup table. Performing the indexing might be more expensive, but if you can fit the entire table in cache, then you can remove all branching from the code, which should speed things up.

For your data the lookup table would be 128^3 = 2,097,152. If you can control one of the three variables so you consider all instances where start = N at one time, then the size of the working set drops down to 128^2 = 16432 bytes, which should fit well in most modern caches.

You would still have to benchmark the actual code to see if a branchless lookup table is sufficiently faster than the obvious comparisons.



回答4:

This answer is to report on a testing done with the accepted answer. I performed a closed range test on a large vector of sorted random integer and to my surprise the basic method of ( low <= num && num <= high) is in fact faster than the accepted answer above! Test was done on HP Pavilion g6 (AMD A6-3400APU with 6GB ram. Here's the core code used for testing:

int num = rand();  // num to compare in consecutive ranges.
chrono::time_point<chrono::system_clock> start, end;
auto start = chrono::system_clock::now();

int inBetween1{ 0 };
for (int i = 1; i < MaxNum; ++i)
{
    if (randVec[i - 1] <= num && num <= randVec[i])
        ++inBetween1;
}
auto end = chrono::system_clock::now();
chrono::duration<double> elapsed_s1 = end - start;

compared with the following which is the accepted answer above:

int inBetween2{ 0 };
for (int i = 1; i < MaxNum; ++i)
{
    if (static_cast<unsigned>(num - randVec[i - 1]) <= (randVec[i] - randVec[i - 1]))
        ++inBetween2;
}

Pay attention that randVec is a sorted vector. For any size of MaxNum the first method beats the second one on my machine!



回答5:

Is it not possible to just perform a bitwise operation on the integer?

Since it has to be between 0 and 128, if the 8th bit is set (2^7) it is 128 or more. The edge case will be a pain, though, since you want an inclusive comparison.



标签: