vectorized radix sort with numpy - can it beat np.

2020-06-18 03:53发布

问题:

Numpy doesn't yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower than numpy's quicksort.

Test and benchmark:

a = np.random.randint(0, 1e8, 1e6)
assert(np.all(radix_sort(a) == np.sort(a))) 
%timeit np.sort(a)
%timeit radix_sort(a)

The mask_b loop can be at least partially vectorized, broadcasting out across masks from &, and using cumsum with axis arg, but that ends up being a pessimization, presumably due to the increased memory footprint.

If anyone can see a way to improve on what I have I'd be interested to hear, even if it's still slower than np.sort...this is more a case of intellectual curiosity and interest in numpy tricks.

Note that you can implement a fast counting sort easily enough, though that's only relevant for small integer data.

Edit 1: Taking np.arange(n) out of the loop helps a little, but that's not very exiciting.

Edit 2: The cumsum was actually redundant (ooops!) but this simpler version only helps marginally with performance..

def radix_sort(a):
    bit_len = np.max(a).bit_length()
    n = len(a)
    cached_arange = arange(n)
    idx = np.empty(n, dtype=int) # fully overwritten each iteration
    for mask_b in xrange(bit_len):
        is_one = (a & 2**mask_b).astype(bool)
        n_ones = np.sum(is_one)      
        n_zeros = n-n_ones
        idx[~is_one] = cached_arange[:n_zeros]
        idx[is_one] = cached_arange[:n_ones] + n_zeros
        # next three lines just do: a[idx] = a, but correctly
        new_a = np.empty(n, dtype=a.dtype)
        new_a[idx] = a
        a = new_a
    return a

Edit 3: rather than loop over single bits, you can loop over two or more at a time, if you construct idx in multiple steps. Using 2 bits helps a little, I've not tried more:

idx[is_zero] = np.arange(n_zeros)
idx[is_one] = np.arange(n_ones)
idx[is_two] = np.arange(n_twos)
idx[is_three] = np.arange(n_threes)

Edits 4 and 5: going to 4 bits seems best for the input I'm testing. Also, you can get rid of the idx step entirely. Now only about 5 times, rather than 10 times, slower than np.sort (source available as gist):

Edit 6: This is a tidied up version of the above, but it's also a tiny bit slower. 80% of the time is spent on repeat and extract - if only there was a way to broadcast the extract :( ...

def radix_sort(a, batch_m_bits=3):
    bit_len = np.max(a).bit_length()
    batch_m = 2**batch_m_bits
    mask = 2**batch_m_bits - 1
    val_set = np.arange(batch_m, dtype=a.dtype)[:, nax] # nax = np.newaxis
    for _ in range((bit_len-1)//batch_m_bits + 1): # ceil-division
        a = np.extract((a & mask)[nax, :] == val_set,
                        np.repeat(a[nax, :], batch_m, axis=0))
        val_set <<= batch_m_bits
        mask <<= batch_m_bits
    return a

Edits 7 & 8: Actually, you can broadcast the extract using as_strided from numpy.lib.stride_tricks, but it doesn't seem to help much performance-wise:

Initially this made sense to me on the grounds that extract will be iterating over the whole array batch_m times, so the total number of cache lines requested by the CPU will be the same as before (it's just that by the end of the process it has request each cache line batch_m times). However the reality is that extract is not clever enough to iterate over arbitrary stepped arrays, and has to expand out the array before beginning, i.e. the repeat ends up being done anyway. In fact, having looked at the source for extract, I now see that the best we can do with this approach is:

a = a[np.flatnonzero((a & mask)[nax, :] == val_set) % len(a)]

which is marginally slower than extract. However, if len(a) is a power of two we can replace the expensive mod operation with & (len(a) - 1), which does end up being a bit faster than the extract version (now about 4.9x np.sort for a=randint(0, 1e8, 2**20). I suppose we could make this work for non-power of two lengths by zero-padding, and then cropping the extra zeros at the end of the sort...however this would be a pessimisation unless the length was already close to being power of two.

回答1:

I had a go with Numba to see how fast a radix sort could be. The key to good performance with Numba (often) is to write out all the loops, which is very instructive. I ended up with the following:

from numba import jit

@jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
    mask = (1 << batch_m_bits) - 1
    for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
        # set bit sums to zero
        for i in range(bitsums.shape[0]):
            bitsums[i] = 0

        # determine bit sums
        for i in range(a.shape[0]):
            j = (a[i] & mask) >> shift
            bitsums[j] += 1

        # take the cumsum of the bit sums
        cumsum = 0
        for i in range(bitsums.shape[0]):
            temp = bitsums[i]
            bitsums[i] = cumsum
            cumsum += temp

        # sorting loop
        for i in range(a.shape[0]):
            j = (a[i] & mask) >> shift
            out[bitsums[j]] = a[i]
            bitsums[j] += 1

        # prepare next iteration
        mask <<= batch_m_bits
        # cant use `temp` here because of numba internal types
        temp2 = a
        a = out
        out = temp2

    return a

From the 4 inner loops, it's easy to see it's the 4th one making it hard to vectorize with Numpy.

One way to cheat around that problem is to pull in a particular C++ function from Scipy: scipy.sparse.coo.coo_tocsr. It does pretty much the same inner loops as the Python function above, so it can be abused to write a faster "vectorized" radix sort in Python. Maybe something like:

from scipy.sparse.coo import coo_tocsr

def radix_step(radix, keys, bitsums, a, w):
    coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
    return w, a

def scipysparse_radix_perbyte(a):
    # coo_tocsr internally works with system int and upcasts
    # anything else. We need to copy anyway to not mess with
    # original array. Also take into account endianness...
    a = a.astype('<i', copy=True)
    bitlen = int(a.max()).bit_length()
    radix = 256
    work = np.empty_like(a)
    _ = np.empty(radix+1, int)
    for i in range((bitlen-1)//8 + 1):
        keys = a.view('u1')[i::a.itemsize].astype(int)
        a, work = radix_step(radix, keys, _, a, work)
    return a

EDIT: Optimized the function a little bit.. see edit history.

One inefficiency of LSB radix sorting like above is that the array is completely shuffled in RAM a number of times, which means the CPU cache isn't used very well. To try to mitigate this effect, one could opt to first do a pass with MSB radix sort, to put items in roughly the right block of RAM, before sorting every resulting group with a LSB radix sort. Here's one implementation:

def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
    """
    Parameters
    ----------
    a : Array of non-negative integers to be sorted.
    bbits : Number of bits in radix for LSB sorting.
    gbits : Number of bits in radix for MSB grouping.
    """
    a = a.copy()
    bitlen = int(a.max()).bit_length()
    work = np.empty_like(a)

    # Group values by single iteration of MSB radix sort:
    # Casting to np.int_ to get rid of python BigInt
    ngroups = np.int_(2**gbits)
    group_offset = np.empty(ngroups + 1, int)
    shift = max(bitlen-gbits, 0)
    a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
    bitlen = shift
    if not bitlen:
        return a

    # LSB radix sort each group:
    agroups = np.split(a, group_offset[1:-1])
    # Mask off high bits to not undo the grouping..
    gmask = (1 << shift) - 1
    nbatch = (bitlen-1) // bbits + 1
    radix = np.int_(2**bbits)
    _ = np.empty(radix + 1, int)
    for agi in agroups:
        if not agi.size:
            continue
        mask = (radix - 1) & gmask
        wgi = work[:agi.size]
        for shift in range(0, nbatch*bbits, bbits):
            keys = (agi & mask) >> shift
            agi, wgi = radix_step(radix, keys, _, agi, wgi)
            mask = (mask << bbits) & gmask
        if nbatch % 2:
            # Copy result back in to `a`
            wgi[...] = agi
    return a

Timings (with best performing settings for each on my system):

def numba_radix(a, batch_m_bits=8):
    a = a.copy()
    bit_len = int(a.max()).bit_length()
    nbatches = (bit_len-1)//batch_m_bits +1
    work = np.zeros_like(a)
    bitsums = np.zeros(2**batch_m_bits + 1, int)
    srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
    return srtd

a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop

Numba performs very well, as expected. And also with some clever application of existing C-extensions it's possible to beat numpy.sort. IMO at the level of optimization you've already gotten it's worth-it to also consider add-ons to Numpy, but I wouldn't really consider the implementations in my answer "vectorized": The bulk of the work is done in a external dedicated function.

One other thing that strikes me is the sensitivity to the choice of radix. For most of the settings I tried my implementations were still slower than numpy.sort, so in practice some sort of heuristic would be required to offer good performance across the board.



回答2:

Can you change this to be a counting / radix sort that works 8 bits at a time? For 32 bit unsigned integers, create a matrix[4][257] of counts of occurrence of byte fields, making one read pass over the array to be sorted. matrix[][0] = 0, matrix[][1] = # of occurences of 0, ... . Then convert the counts into indexes, where matrix[][0] = 0, matrix[][1] = # of bytes == 0, matrix[][2] = # of bytes == 0 + # of bytes == 1, ... . The last count is not used, since that would index the end of the array. Then do 4 passes of radix sort, moving data back and forth between the original array and the output array. Working 16 bits at time would need a matrix[2][65537], but only take 2 passes. Example C code:

size_t mIndex[4][257] = {0};            /* index matrix */
size_t i, j, m;
uint32_t u;
uint32_t *pData;                        /* ptr to original array */
uint32_t *pTemp;                        /* ptr to working array */
uint32_t *pSrc;                         /* working ptr */
uint32_t *pDst;                         /* working ptr */
/* n is size of array */
    for(i = 0; i < n; i++){             /* generate histograms */
        u = pData[i];
        for(j = 0; j < 4; j++){
            mIndex[j][1 + (size_t)(u & 0xff)]++; /* note [1 + ... */
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             /* convert to indices */
        for(i = 1; i < 257; i++){       /* (last count never used) */
            mIndex[j][i] += mIndex[j][i-1]
        }       
    }
    pDst = pTemp;                       /* radix sort */
    pSrc = pData;
    for(j = 0; j < 4; j++){
        for(i = 0; i < count; i++){     /* sort pass */
            u = pSrc[i];
            m = (size_t)(u >> (j<<3)) & 0xff;
        /*  pDst[mIndex[j][m]++] = u;      split into 2 lines */
            pDst[mIndex[j][m]] = u;
            mIndex[j][m]++;
        }
        pTmp = pSrc;                    /* swap ptrs */
        pSrc = pDst;
        pDst = pTmp;
    }