How to bitwise operate on memory block (C++)

Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop? After looking it to options I noticed that std has a member std::bitset, and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?

Edit / update: I think union might apply here, such that the memory block is allocated a new array of int or something and then manipulated as a large bitset. Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .

In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:

Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
[edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.

C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.

size_t len = ...;
char *buffer = ...;

size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;

long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;

// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
    size_t j = loads_per_i*i;
    ptr[j  ] ^= xorvalue;
    ptr[j+1] ^= xorvalue;
    ptr[j+2] ^= xorvalue;
    ptr[j+3] ^= xorvalue;
}

// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
    ptr[i] ^= xorvalue;
}

// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
    buffer[i] ^= xorvalue;
}