How fast can you make linear search?

2019-01-31 12:58发布

问题:

I'm looking to optimize this linear search:

static int
linear (const int *arr, int n, int key)
{
        int i = 0;
        while (i < n) {
                if (arr [i] >= key)
                        break;
                ++i;
        }
        return i;
}

The array is sorted and the function is supposed to return the index of the first element that is greater or equal to the key. They array is not large (below 200 elements) and will be prepared once for a large number of searches. Array elements after the n-th can if necessary be initialized to something appropriate, if that speeds up the search.

No, binary search is not allowed, only linear search.

Edit: All my knowledge about this topic is now summarized in this blog post.

回答1:

  1. Tell your boss you can make it 50% faster, but it will take 6 months, and some money.
  2. Wait six months.
  3. Buy new hardware.

Well, it makes about as much sense as a linear search through a sorted array!

(More seriously, can you give us some clues about why no binary search?)



回答2:

So far you received multiple advice most of which state that linear search makes no sense on sorted data, when binary search will work much more efficiently instead. This often happens to be one of those popular "sounds right" assertions made by people who don't care to give the problem too much thought. In reality, if you consider the bigger picture, given the right circumstances, linear search can be much more efficient than binary search.

Note, that if we consider a single search query for a sorted array, binary search is significantly more efficient method than linear search. There's no argument about that. Also, when you perform multiple completely random queries to the same data binary search still wins over linear search.

However, the picture begins to change if we consider sequential search queries and these queries are not exactly random. Imagine that queries arrive in sorted order, i.e. each next query is for a higher value than the previous query. I.e. the queries are also sorted. BTW, they don't have to be globally and strictly sorted, from time to time the query sequence might get "reset", i.e. a low value is queried, but on average the consequent queries should arrive in increasing order. In other words, the queries arrive in series, each series sorted in ascending order. In this case, if the average length of the series is comparable to the length of your array, linear search will outperform binary search by a huge margin. However, to take advantage of this situation, you have to implement your search in incremental manner. It is simple: if the next query is greater than the previous one, you don't need to start the search from the beginning of the array. Instead, you can search from the point where the previous search stopped. The most simplistic implementation (just to illustrate the idea) might look as follows

static int linear(const int *arr, int n, int key)
{
  static int previous_key = INT_MIN;
  static int previous_i = 0;

  i = key >= previous_key ? previous_i : 0;

  while (i < n) {
    if (arr[i] >= key)
      break;
    ++i;
  }

  previous_key = key;
  previous_i = i;

  return i;
}

(Disclaimer: the above implementation is terribly ugly for the obvious reason that the array is arriving from outside as a parameter, while the previous search state is stored internally. Of course, this is wrong way to do it in practice. But again, the above is intended to illustrate the idea and no more).

Note, that the complexity of processing each series of ordered queries using the above approach is always O(N), regardless of the length of the series. Using the binary search, the complexity would be O(M * log N). So, for obvious reasons when M is close to N, i.e. queries arrive in sufficiently long ordered series, the above linear search will significantly outperform binary search, while for small M the binary search will win.

Also, even if the ordered series of queries are not very long, the above modification might still give you a noticeable improvement in search performance, considering that you have to use linear search.

P.S. As an additional piece of information about the structure of the problem:

When you need to perform the search in an ordered array of length N and you know in advance that the queries will arrive in ordered series of [approximate, average] length M, the optimal algorithm will look as follows

  1. Calculate the stride value S = [N/M]. It might also make sense to "snap" the value of S to the [nearest] power of 2. Think of your sorted array as a sequence of blocks of length S - so called S-blocks.
  2. After receiving a query, perform incremental linear search for the S-block that potentially contains the queried value, i.e. it is an ordinary linear search with stride S (of course, remember to start from the block where the previous search left off).
  3. After finding the S-block, perform the binary search within the S-block for the queried value.

The above is the most optimal incremental search algorithm possible, in a sense that it achieves the theoretical limit on the asymptotic efficiency of repetitive search. Note, that if the value of M is much smaller then N, the algorithm "automatically" shifts itself towards binary search, while when M gets close to N the algorithm "automatically" favors linear search. The latter makes sense because in such environment linear search is significantly more efficient than binary search.

This all is just to illustrate the fact that blanket statements like "linear search on a sorted array is always useless" indicate nothing else than lack of knowledge on the part of those who make such statements.



回答3:

Since you can put known values after the last valid entry, add an extra element n+1 = max to make sure the loop doesn't go past the end of the array without having to test for i < n.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                ++i;
        }
        return i;
}

You could also try unrolling the loop, with the same sentinel value:

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (true) {
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
        }
        return --i;
}


回答4:

First of all, any fast solution must use vectorization to compare many elements at once.

However, all the vectorized implementations posted so far suffer from a common problem: they have branches. As a result, they have to introduce blockwise processing of the array (to reduce overhead of branching), which leads to low performance for small arrays. For large arrays linear search is worse than a well-optimized binary search, so there is no point in optimizing it.

However, linear search can be implemented without branches at all. The idea is very simple: the index you want is precisely the number of elements in the array that are less than the key you search for. So you can compare each element of the array to the key value and sum all the flags:

static int linear_stgatilov_scalar (const int *arr, int n, int key) {
    int cnt = 0;
    for (int i = 0; i < n; i++)
        cnt += (arr[i] < key);
    return cnt;
}

A fun thing about this solution is that it would return the same answer even if you shuffle the array =) Although this solution seems to be rather slow, it can be vectorized elegantly. The implementation provided below requires array to be 16-byte aligned. Also, the array must be padded with INT_MAX elements because it consumes 16 elements at once.

static int linear_stgatilov_vec (const int *arr, int n, int key) {
    assert(size_t(arr) % 16 == 0);
    __m128i vkey = _mm_set1_epi32(key);
    __m128i cnt = _mm_setzero_si128();
    for (int i = 0; i < n; i += 16) {
        __m128i mask0 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+0]), vkey);
        __m128i mask1 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+4]), vkey);
        __m128i mask2 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+8]), vkey);
        __m128i mask3 = _mm_cmplt_epi32(_mm_load_si128((__m128i *)&arr[i+12]), vkey);
        __m128i sum = _mm_add_epi32(_mm_add_epi32(mask0, mask1), _mm_add_epi32(mask2, mask3));
        cnt = _mm_sub_epi32(cnt, sum);
    }
    cnt = _mm_hadd_epi32(cnt, cnt);
    cnt = _mm_hadd_epi32(cnt, cnt);
//  int ans = _mm_extract_epi32(cnt, 0);    //SSE4.1
    int ans = _mm_extract_epi16(cnt, 0);    //correct only for n < 32K
    return ans;
}

The final reduction of a single SSE2 register can be implemented with SSE2 only if necessary, it should not really affect the overall performance.

I have tested it with Visual C++ 2013 x64 compiler on Intel Core2 Duo E4700 (quite old, yeah). The array of size 197 is generated with elements provided by rand(). The full code with all the testing is here. Here is the time to perform 32M searches:

[OP]
Time = 3.155 (-896368640) //the original OP's code
[Paul R]
Time = 2.933 (-896368640)
[stgatilov]
Time = 1.139 (-896368640) //the code suggested

The OP's original code processes 10.6 millions of array per second (2.1 billion elements per second). The suggested code processes 29.5 millions of arrays per second (5.8 billion elements per second). Also, the suggested code works well for smaller arrays: even for arrays of 15 elements, it is still almost three times faster than OP's original code.

Here is the generated assembly:

$LL56@main:
    movdqa  xmm2, xmm4
    movdqa  xmm0, xmm4
    movdqa  xmm1, xmm4
    lea rcx, QWORD PTR [rcx+64]
    pcmpgtd xmm0, XMMWORD PTR [rcx-80]
    pcmpgtd xmm2, XMMWORD PTR [rcx-96]
    pcmpgtd xmm1, XMMWORD PTR [rcx-48]
    paddd   xmm2, xmm0
    movdqa  xmm0, xmm4
    pcmpgtd xmm0, XMMWORD PTR [rcx-64]
    paddd   xmm1, xmm0
    paddd   xmm2, xmm1
    psubd   xmm3, xmm2
    dec r8
    jne SHORT $LL56@main
$LN54@main:
    phaddd  xmm3, xmm3
    inc rdx
    phaddd  xmm3, xmm3
    pextrw  eax, xmm3, 0

Finally, I'd like to note that a well-optimized binary search can be made faster by switching to the described vectorized linear search as soon as the interval becomes small.

UPDATE: More information can be found in my blog post on the matter.



回答5:

If a target-specific solution is acceptable then you can quite easily use SIMD (SSE, AltiVec, or whatever you have available) to get ~ 4x speed-up by testing 4 elements at a time rather than just 1.

Out of interest I put together a simple SIMD implementation as follows:

int linear_search_ref(const int32_t *A, int32_t key, int n)
{
    int result = -1;
    int i;

    for (i = 0; i < n; ++i)
    {
        if (A[i] >= key)
        {
            result = i;
            break;
        }
    }
    return result;
}

int linear_search(const int32_t *A, int32_t key, int n)
{
#define VEC_INT_ELEMS 4
#define BLOCK_SIZE (VEC_INT_ELEMS * 32)
    const __m128i vkey = _mm_set1_epi32(key);
    int vresult = -1;
    int result = -1;
    int i, j;

    for (i = 0; i <= n - BLOCK_SIZE; i += BLOCK_SIZE)
    {
        __m128i vmask0 = _mm_set1_epi32(-1);
        __m128i vmask1 = _mm_set1_epi32(-1);
        int mask0, mask1;

        for (j = 0; j < BLOCK_SIZE; j += VEC_INT_ELEMS * 2)
        {
            __m128i vA0 = _mm_load_si128(&A[i + j]);
            __m128i vA1 = _mm_load_si128(&A[i + j + VEC_INT_ELEMS]);
            __m128i vcmp0 = _mm_cmpgt_epi32(vkey, vA0);
            __m128i vcmp1 = _mm_cmpgt_epi32(vkey, vA1);
            vmask0 = _mm_and_si128(vmask0, vcmp0);
            vmask1 = _mm_and_si128(vmask1, vcmp1);
        }
        mask0 = _mm_movemask_epi8(vmask0);
        mask1 = _mm_movemask_epi8(vmask1);
        if ((mask0 & mask1) != 0xffff)
        {
            vresult = i;
            break;
        }
    }
    if (vresult > -1)
    {
        result = vresult + linear_search_ref(&A[vresult], key, BLOCK_SIZE);
    }
    else if (i < n)
    {
        result = i + linear_search_ref(&A[i], key, n - i);
    }
    return result;
#undef BLOCK_SIZE
#undef VEC_INT_ELEMS
}

On a 2.67 GHz Core i7, using OpenSUSE x86-64 and gcc 4.3.2, I get around 7x - 8x improvement around a fairly broad "sweet spot" where n = 100000 with the key being found at the midpoint of the array (i.e. result = n / 2). Performance drops off to around 3.5x when n gets large and the array therefore exceeds cache size (presumably becoming memory bandwidth-limited in this case). Performance also drops off when n is small, due to inefficiency of the SIMD implementation (it was optimised for large n of course).



回答6:

You've received many suggestions for improvements, but you need to measure each optimization to see which is best given your hardware and compiler.

As an example of this, in the first version of this response, I guessed that by 100-200 array elements, the slightly higher overhead of binary search should easily be paid for by far fewer probes into the array. However, in the comments below, Mark Probst reports that he sees linear search ahead up to about 500 entries on his hardware. This reinforces the need to measure when searching for the very best performance.

Note: Edited following Mark's comments below on his measurements of linear versus binary search for reasonably small N.



回答7:

You can do it in parallel.

If the list is small, maybe it won't be worth to split the search, but if have to process lots of searches, then you can definitively run them in parallel. That wouldn't reduce the latency of the operations, but would improve the throughput.



回答8:

If you're on an Intel platform:

int linear (const int *array, int n, int key)
{
  __asm
  {
    mov edi,array
    mov ecx,n
    mov eax,key
    repne scasd
    mov eax,-1
    jne end
    mov eax,n
    sub eax,ecx
    dec eax
end:
  }
}

but that only finds exact matches, not greater than or equal matches.

In C, you can also use Duff's Device:

int linear (const int *array, int n, int key)
{
  const int
    *end = &array [n];

  int
    result = 0;

  switch (n % 8)
  {
    do {
  case 0:
    if (*(array++) >= key) break;
    ++result;
  case 7:
    if (*(array++) >= key) break;
    ++result;
  case 6:
    if (*(array++) >= key) break;
    ++result;
  case 5:
    if (*(array++) >= key) break;
    ++result;
  case 4:
    if (*(array++) >= key) break;
    ++result;
  case 3:
    if (*(array++) >= key) break;
    ++result;
  case 2:
    if (*(array++) >= key) break;
    ++result;
  case 1:
    if (*(array++) >= key) break;
    ++result;
    } while(array < end);
  }

  return result;
}


回答9:

If you had a quantum computer, you could use Grover's algorithm to search your data in O(N1/2) time and using O(log N) storage space. Otherwise, your question is pretty silly. Binary search or one of its variants (trinary search, for example) is really your best choice. Doing micro-optimizations on a linear search is stupid when you can pick a superior algorithm.



回答10:

unroll with fixed array indices.

int linear( const int *array, int n, int key ) {
  int i = 0;
  if ( array[n-1] >= key ) {
     do {
       if ( array[0] >= key ) return i+0;
       if ( array[1] >= key ) return i+1;
       if ( array[2] >= key ) return i+2;
       if ( array[3] >= key ) return i+3;
       array += 4;
       i += 4;
     } while ( true );
  }
  return -1;
}


回答11:

You could avoid n checks similar to how loop unrolling does it

static int linear(const int *array, int arraySize, int key)
{
  //assuming the actual size of the array is always 1 less than arraySize
  array[arraySize] = key; 

  int i = 0;
  for (; ; ++i)
  {
     if (array[i] == key) return i;
  }
}


回答12:

I know that this topic is old, but I could not keep myself from posting. My optimization for a sentinel linear search is:

int sentinel_linear_search(int key, int *arr, int n)
{
    int last_value, i;

    /* considering that n is the real size of the array */
    if (--n < 1)
        return -1;

    last_value = arr[n];

    /* set array last member as the key */
    arr[n] = key;

    i = 0;
    while (arr[i] != key)
        ++i;

    /* recover the real array last member */
    arr[n] = last_value;

    return (arr[i] == key) ? i : -1;
}

The sentinel search great improvement is that its iteration uses only one conditional branch (key) instead of two (index and key).

    while (arr[i] != key)
        ++i;


回答13:

loop backwards, this might be translated...

// loop backward

for (int i = arraySize - 1; i >=0; --i)

...to this( "could be" faster ):

    mov cx, arraySize - 1
detectionHere:
    ...
    loop detectionHere   

Other than that, only binary search can make searching faster



回答14:

this might force vector instructions (suggested by Gman):

for (int i = 0; i < N; i += 4) {
    bool found = false;   
    found |= (array[i+0] >= key);
    ...
    found |= ( array[i+3] >= key);
    // slight variation would be to use max intrinsic
    if (found) return i;
}
...
// quick search among four elements

this also makes fewer branch instructions. you make help by ensuring input array is aligned to 16 byte boundary

another thing that may help vectorization (doing vertical max comparison):

for (int i = 0; i < N; i += 8) {
    bool found = false;   
    found |= max(array[i+0], array[i+4]) >= key;
    ...
    found |= max(array[i+3], array[i+7] >= key;
    if (found) return i;
}
// have to search eight elements


回答15:

You could search for a larger element than an int at a time - platform specifically, this can be much faster or slower depending on how it handles the larger data read. For instance, on a 64-bit system, reading in the array 2 elements at a time and checking the hi/low elements seperately could run faster due to less I/O. Still, this is an O(n) variety sort no matter what.



回答16:

This answer is a little more obscure than my other one, so I'm posting it separately. It relies on the fact that C guarantees a boolean result false=0 and true=1. X86 can produce booleans without branching, so it might be faster, but I haven't tested it. Micro-optimizations like these will always be highly dependent on your processor and compiler.

As before, the caller is responsible for putting a sentinel value at the end of the array to ensure that the loop terminates.

Determining the optimum amount of loop unrolling takes some experimentation. You want to find the point of diminishing (or negative) returns. I'm going to take a SWAG and try 8 this time.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
       }
       return i;
}

Edit: As Mark points out, this function introduces a dependency in each line on the line preceding, which limits the ability of the processor pipeline to run operations in parallel. So lets try a small modification to the function to remove the dependency. Now the function does indeed require 8 sentinel elements at the end.

static int 
linear (const int *arr, int n, int key) 
{ 
        assert(arr[n] >= key);
        assert(arr[n+7] >= key);
        int i = 0; 
        while (arr[i] < key) {
                int j = i;
                i += (arr[j] < key); 
                i += (arr[j+1] < key); 
                i += (arr[j+2] < key); 
                i += (arr[j+3] < key); 
                i += (arr[j+4] < key); 
                i += (arr[j+5] < key); 
                i += (arr[j+6] < key); 
                i += (arr[j+7] < key); 
       } 
       return i; 
} 


回答17:

In reality, the answer to this question is 100% dependent on the platform you're writing the code for. For example:

CPU : Memory speed | Example CPU | Type of optimisation
========================================================================
    Equal          |    8086     | (1) Loop unrolling
------------------------------------------------------------------------
  CPU > RAM        |  Pentium    | (2) None
  1. Avoiding the conditional branch required to loop though the data will give a slight performance improvement.
  2. Once the CPU starts to get faster than the RAM, it doesn't matter how efficient the loop becomes (unless it's a really bad loop), it will be stalling due to having to wait for the data to be brought in from RAM. SIMD doesn't really help since the advantage of parallel testing is still outweighed by having to wait for more data to arrive. SIMD really comes into its own when you're CPU limited.


回答18:

In one of the comments you said the array length is 64.

Well if you must do it linearly, you can do:

int i = -1;
do {
  if (arr[0] >= key){i = 0; break;}
  if (arr[1] >= key){i = 1; break;}
  ...
  if (arr[62] >= key){i = 62; break;}
  if (arr[63] >= key){i = 63; break;}
} while(0);

However, I seriously doubt if that is faster than this binary search: *

int i = 0;
if (key >= arr[i+32]) i += 32;
if (key >= arr[i+16]) i += 16;
if (key >= arr[i+ 8]) i +=  8;
if (key >= arr[i+ 4]) i +=  4;
if (key >= arr[i+ 2]) i +=  2;
if (key >= arr[i+ 1]) i +=  1;

*Thanks to Jon Bentley for that one.

Added: since you said this table is prepared once for a large number of searches, and you want fast, you could allocate some space somewhere and generate machine code with the values hard-wired into it. It could either be linear or binary search. If binary, the machine code would look like what the compiler would generate from this:

if (key < value32){
  if (key < value16){
    ...
  }
  else {
    ...
  }
}
else {
  if (key < value48){
    ...
  }
  else {
    ...
  }
}

Then you just copy that into a place where you can call it.

OR you could print the code above, compile and link it on the fly into a dll, and load the dll.



回答19:

uint32 LinearFindSse4( uint8* data, size_t data_len, uint8* finddata, size_t finddatalen )
{
    /**
     * the following is based on...
     * #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
     * we split it into 2 sections
     * first section is:
     * (v) - 0x01010101UL)
     *
     * second section is:
     * ~(v) & 0x80808080UL)
     */
    __m128i ones = _mm_set1_epi8( 0x01 );
    __m128i eights = _mm_set1_epi8( 0x80 );
    __m128i find_field = _mm_set1_epi8( finddata[0] );

    uint32 found_at = 0;
    for (int i = 0; i < data_len; i+=16)
    {
#define CHECKTHIS( n ) if (!memcmp(&data[i+n], &finddata[0], sizeof(finddata))) { found_at = i + n; break; }

        __m128i chunk = _mm_stream_load_si128( (__m128i *)&data[i] );
        __m128i xor_result = _mm_xor_si128( chunk, find_field );
        __m128i first_sec = _mm_sub_epi64( xor_result, ones );
        __m128i second_sec = _mm_andnot_si128( xor_result, eights );

        if(!_mm_testz_si128(first_sec, second_sec))
        {
            CHECKTHIS(0);
            CHECKTHIS(1);
            CHECKTHIS(2);
            CHECKTHIS(3);
            CHECKTHIS(4);
            CHECKTHIS(5);
            CHECKTHIS(6);
            CHECKTHIS(7);
            CHECKTHIS(8);
            CHECKTHIS(9);
            CHECKTHIS(10);
            CHECKTHIS(11);
            CHECKTHIS(12);
            CHECKTHIS(13);
            CHECKTHIS(14);
            CHECKTHIS(15);
        }
    }
    return found_at;
}


回答20:

Well, you could use pointers...

static int linear(const int *array, int arraySize, int key) {
    int i;

    for(i = 0; i < arraySize; ++i) {
        if(*array >= key) {
            return i;
        }

        ++array;
    }

    return arraySize;
}