C++ Fast and Efficient way to perform bit_count an

2019-07-31 10:28发布

问题:

In my project I need to AND two binary array of size 40 bytes(320 bits) and then compute set bit count in C++. I found some algorithms to do this but I want to know what is the fastest way of implementing it in c++. I mean what c++ data type would be proper?(unsinged char*,unsigned int 32,u_int64,...). I know many algorithms are compatible with 32bit integer although my array size is 40 bytes .

what about the algorithms described in this link: Fast Bit Counting Techniques which one is faster?

Is const type better or there is no difference?

Any help would be much appreciated.

回答1:

Here's a version that goes through the array with 4 bytes at once, requiring 10 iterations:

uint32_t *arr1_int = (uint32_t*) arr1;
uint32_t *arr2_int = (uint32_t*) arr2;
int i;
int bits_set = 0;

for (i = 0; i < 10; i++) {
    uint32_t v = arr1_int[i] & arr2_int[i];

    /* http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel */
    v = v - ((v >> 1) & 0x55555555);                   
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333);    
    bits_set += ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

You can do this a lot faster with a modern CPU, using compiler intrinsics. For example on a 64 bit CPU with Visual C++:

#include <intrin.h>

__int64 *arr1_int = (__int64*) arr1;
__int64 *arr2_int = (__int64*) arr2;
int bits_set = 0;

/* 40 / 8 bytes == 5 iterations */
bits_set += __popcnt64(*arr1_int++ & *arr2_int++);
bits_set += __popcnt64(*arr1_int++ & *arr2_int++);
bits_set += __popcnt64(*arr1_int++ & *arr2_int++);
bits_set += __popcnt64(*arr1_int++ & *arr2_int++);
bits_set += __popcnt64(*arr1_int++ & *arr2_int++);

But this is all with performance in mind, if you simply want some readable code that works definitely go with what Rob suggested.



回答2:

I mean what c++ data type would be proper?

std::bitset<320>.

Any algorithm you come up with should be compared in speed and convenience to this one:

std::bitset<320> first;
std::bitset<320> other;

// twiddle bits here ...

std::bitset<320> and_result(first & other);
std::size_t number_of_bits(and_result.count());

If the alternatives don't go significantly faster, just use code like the above. It will clearly express your intent, and will avoid maintenance headaches later on.



回答3:

Something simple like this should be plenty fast enough:

const uint8_t LUT[256] = { 0, 1, 1, 2, ..., 8 }; // pop count LUT for bytes

int count_bits(const uint8_t *a1, const uint8_t *a2, int n)
{
    int count = 0;

    for (int i = 0; i < n; ++i)
    {
        count += LUT[a1[i] & a2[i]];
    }
    return count;
}

The are three loads and two ALU operations per byte, i.e. 120 loads and 80 ALU ops for your 40 byte use case.

Try it, profile it, and if it's not fast enough then you can look at more complex solutions that might be faster.