What is the best algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits must be reversed; that is, this is not endianness-swapping.
What is the best algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits must be reversed; that is, this is not endianness-swapping.
Presuming that you have an array of bits, how about this: 1. Starting from MSB, push bits into a stack one by one. 2. Pop bits from this stack into another array (or the same array if you want to save space), placing the first popped bit into MSB and going on to less significant bits from there.
Another loop-based solution that exits quickly when the number is low (in C++ for multiple types)
or in C for an unsigned int
How about the following:
Small and easy (though, 32 bit only).
This thread caught my attention since it deals with a simple problem that requires a lot of work (CPU cycles) even for a modern CPU. And one day I also stood there with the same ¤#%"#" problem. I had to flip millions of bytes. However I know all my target systems are modern Intel-based so let's start optimizing to the extreme!!!
So I used Matt J's lookup code as the base. the system I'm benchmarking on is a i7 haswell 4700eq.
Matt J's lookup bitflipping 400 000 000 bytes: Around 0.272 seconds.
I then went ahead and tried to see if Intel's ISPC compiler could vectorise the arithmetics in the reverse.c.
I'm not going to bore you with my findings here since I tried a lot to help the compiler find stuff, anyhow I ended up with performance of around 0.15 seconds to bitflip 400 000 000 bytes. It's a great reduction but for my application that's still way way too slow..
So people let me present the fastest Intel based bitflipper in the world. Clocked at:
Time to bitflip 400000000 bytes: 0.050082 seconds !!!!!
The printf's are for debugging..
Here is the workhorse:
The code takes 32 bytes then masks out the nibbles. The high nibble gets shifted right by 4. Then I use vpshufb and ymm4 / ymm3 as lookup tables. I could use a single lookup table but then I would have to shift left before ORing the nibbles together again.
There are even faster ways of flipping the bits. But I'm bound to single thread and CPU so this was the fastest I could achieve. Can you make a faster version?
Please make no comments about using the Intel C/C++ Compiler Intrinsic Equivalent commands...
Bit reversal in pseudo code
source -> byte to be reversed b00101100 destination -> reversed, also needs to be of unsigned type so sign bit is not propogated down
copy into temp so original is unaffected, also needs to be of unsigned type so that sign bit is not shifted in automaticaly
LOOP8: //do this 8 times test if bytecopy is < 0 (negative)