What is the best algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits must be reversed; that is, this is not endianness-swapping.
What is the best algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits must be reversed; that is, this is not endianness-swapping.
It seems that many other posts are concerned about speed (i.e best = fastest). What about simplicity? Consider:
and hope that clever compiler will optimise for you.
If you want to reverse a longer list of bits (containing
sizeof(char) * n
bits), you can use this function to get:This would reverse [10000000, 10101010] into [01010101, 00000001].
NOTE: All algorithms below are in C, but should be portable to your language of choice (just don't look at me when they're not as fast :)
Options
Low Memory (32-bit
int
, 32-bit machine)(from here):From the famous Bit Twiddling Hacks page:
Fastest (lookup table):
You can extend this idea to 64-bit
int
s, or trade off memory for speed (assuming your L1 Data Cache is large enough), and reverse 16 bits at a time with a 64K-entry lookup table.Others
Simple
Faster (32-bit processor)
Faster (64-bit processor)
If you want to do this on a 32-bit
int
, just reverse the bits in each byte, and reverse the order of the bytes. That is:Results
I benchmarked the two most promising solutions, the lookup table, and bitwise-AND (the first one). The test machine is a laptop w/ 4GB of DDR2-800 and a Core 2 Duo T7500 @ 2.4GHz, 4MB L2 Cache; YMMV. I used gcc 4.3.2 on 64-bit Linux. OpenMP (and the GCC bindings) were used for high-resolution timers.
reverse.c
reverse_lookup.c
I tried both approaches at several different optimizations, ran 3 trials at each level, and each trial reversed 100 million random
unsigned ints
. For the lookup table option, I tried both schemes (options 1 and 2) given on the bitwise hacks page. Results are shown below.Bitwise AND
Lookup Table (option 1)
Lookup Table (option 2)
Conclusion
Use the lookup table, with option 1 (byte addressing is unsurprisingly slow) if you're concerned about performance. If you need to squeeze every last byte of memory out of your system (and you might, if you care about the performance of bit reversal), the optimized versions of the bitwise-AND approach aren't too shabby either.
Caveat
Yes, I know the benchmark code is a complete hack. Suggestions on how to improve it are more than welcome. Things I know about:
ld
blew up with some crazy symbol redefinition error), so I don't believe the generated code is tuned for my microarchitecture.32-bit
EDIT: I also tried using
uint64_t
types on my machine to see if there was any performance boost. Performance was about 10% faster than 32-bit, and was nearly identical whether you were just using 64-bit types to reverse bits on two 32-bitint
types at a time, or whether you were actually reversing bits in half as many 64-bit values. The assembly code is shown below (for the former case, reversing bits for two 32-bitint
types at a time):I think the simplest method I know follows.
MSB
is input andLSB
is 'reversed' output:This ain't no job for a human! ... but perfect for a machine
This is 2015, 6 years from when this question was first asked. Compilers have since become our masters, and our job as humans is only to help them. So what's the best way to give our intentions to the machine?
Bit-reversal is so common that you have to wonder why the x86's ever growing ISA doesn't include an instruction to do it one go.
The reason: if you give your true concise intent to the compiler, bit reversal should only take ~20 CPU cycles. Let me show you how to craft reverse() and use it:
Compiling this sample program with Clang version >= 3.6, -O3, -march=native (tested with Haswell), gives artwork-quality code using the new AVX2 instructions, with a runtime of 11 seconds processing ~1 billion reverse()s. That's ~10 ns per reverse(), with .5 ns CPU cycle assuming 2 GHz puts us at the sweet 20 CPU cycles.
Caveat: this sample code should hold as a decent benchmark for a few years, but it will eventually start to show its age once compilers are smart enough to optimize main() to just printf the final result instead of really computing anything. But for now it works in showcasing reverse().
My simple solution