I'm not sure the exact term for what I'm trying to do. I have an 8x8
block of bits
stored in 8 bytes
, each byte stores one row. When I'm finished, I'd like each byte to store one column.
For example, when I'm finished:
Byte0out = Byte0inBit0 + Byte1inBit0 + Byte2inBit0 + Byte3inBit0 + ...
Byte1out = Byte0inBit1 + Byte1inBit1 + Byte2inBit1 + Byte3inBit1 + ...
What is the easiest way to do this in C which performs well?
This code is cribbed directly from "Hacker's Delight" - Figure 7-2 Transposing an 8x8-bit matrix, I take no credit for it:
I didn't check if this rotates in the direction you need, if not you might need to adjust the code.
Also, keep in mind datatypes & sizes -
int
&unsigned (int)
might not be 32 bits on your platform.BTW, I suspect the book (Hacker's Delight) is essential for the kind of work you're doing... check it out, lots of great stuff in there.
This sounds a lot like a so-called "Chunky to planar" routine used on displays that use bitplanes. The following link uses MC68K assembler for its code, but provides a nice overview of the problem (assuming I understood the question correctly):
http://membres.multimania.fr/amycoders/sources/c2ptut.html
If you are looking for the simplest solution:
If your are looking for the fastest solution:
How to transpose a bit matrix in the assembly by utilizing SSE2.
You really want to do something like this with SIMD instructions with something like the GCC vector vector support: http://ds9a.nl/gcc-simd/example.html
This is similar to the get column in a bitboard problem and can be solved efficiently by considering those input bytes as 8 bytes of a 64-bit integer. If bit 0 is the least significant one and byte 0 is the first byte in the array then I assume you want to do the following
with bXY is byte X's bit number Y. Masking out all the first 7 columns and read the array as an uint64_t we'll have
in little endian, with
abcdefgh
are b00 to b70 respectively. Now we just need to multiply that value with the magic number 0x2040810204081 to make a value withhgfedcba
in the MSB which is the flipped form in the resultIn reality you should read directly into an 8-byte array so that you don't need to combine the bytes later, but you need to align the array properly
In AVX2 Intel introduced the PDEP instruction (accessible via the
_pext_u64
intrinsic) in the BMI2 instruction set for this purpose so the function can be done in a single instructionMore ways to transpose the array can be found in the chess programming wiki
If you wanted an optimized solution you would use the SSE extensions in x86. You'd need to use 4 of these SIMD opcodes. MOVQ - move 8 bytes PSLLW - packed shift left logical words PMOVMSKB - packed move mask byte And 2 regular x86 opcodes LEA - load effective address MOV - move
25 x86 opcodes/instructions as opposed to the stacked for...loop solution with 64 iterations. Sorry the notation is not the ATT style syntax that c/c++ compilers accept.