Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop? After looking it to options I noticed that std has a member std::bitset
, and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?
Edit / update: I think union
might apply here, such that the memory block is allocated a new
array of int
or something and then manipulated as a large bitset
. Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .
In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:
- Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
- If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
- If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
- You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
- Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
- [edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.
C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.
size_t len = ...;
char *buffer = ...;
size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;
long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;
// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
size_t j = loads_per_i*i;
ptr[j ] ^= xorvalue;
ptr[j+1] ^= xorvalue;
ptr[j+2] ^= xorvalue;
ptr[j+3] ^= xorvalue;
}
// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
ptr[i] ^= xorvalue;
}
// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
buffer[i] ^= xorvalue;
}