I have some performance-heavy code that performs bit manipulations. It can be reduced to the following well-defined problem:
Given a 13-bit bitmap, construct a 26-bit bitmap that contains the original bits spaced at even positions.
To illustrate:
0000000000000000000abcdefghijklm (input, 32 bits)
0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits)
I currently have it implemented in the following way in C:
if (input & (1 << 12))
output |= 1 << 24;
if (input & (1 << 11))
output |= 1 << 22;
if (input & (1 << 10))
output |= 1 << 20;
...
My compiler (MS Visual Studio) turned this into the following:
test eax,1000h
jne 0064F5EC
or edx,1000000h
... (repeated 13 times with minor differences in constants)
I wonder whether i can make it any faster. I would like to have my code written in C, but switching to assembly language is possible.
- Can i use some MMX/SSE instructions to process all bits at once?
- Maybe i can use multiplication? (multiply by 0x11111111 or some other magical constant)
- Would it be better to use condition-set instruction (SETcc) instead of conditional-jump instruction? If yes, how can i make the compiler produce such code for me?
- Any other idea how to make it faster?
- Any idea how to do the inverse bitmap transformation (i have to implement it too, bit it's less critical)?
You could do:
You haven't specified the platform this is to run on, and I'd like to try a different approach from the ones already posted (I like the lookup table one, which works fine until the number of bits is increased).
Most platforms have separate shift and rotate instructions. Almost always there is an instruction that includes the carry / overflow flags, so you can "shift in" a bit you want. Let's say we have these instructions: * SHIFTLEFT: does a leftshift and fills the lower bit with zero. * ROTATELEFT: does a leftshift, sets the lowest bit from the former value in carry flag, and sets the carry from the bit that got shifted "out" on left.
Pseudocode:
... repeat 13 times. Unrolling as you please.
The first shift should get the uppermost bit into place right before the carry. ROTATELEFT A will push the MSB into the carry, ROTATELEFT B will push the bit into the LSB of B, and SHIFTLEFT B will put the 0 in. Do that for all the bits.
Edit/Added:
You can do the opposite (inverse bitmap transformation) with the same instructions, like this:
LOAD value into register A; LOAD 0 into register B;
ROTATELEFT A; ROTATELEFT A; ROTATELEFT B; ... repeat 13 times and then SHIFTLEFT B; for (registerwidth-13) times.
LSB to carry; forget about it, next LSB into carry, put that into the target register, repeat for all bits, then align the result.
I'll give an algorithm that works without conditionals (only addition and bitwise operations), and I believe this will be faster than your current solution.
Here's the C code for 13 bits. Below there's an illustration of how the method works for 3 bits, and the generalization will be clear I hope.
(Note: The code is loop-unrolled. A good compiler will do that for you, so you can just condense it to a loop.)
Now, here's the explanation of the method for 3 bits. The initial state is '00abc'. Start by moving 'a' two places to the left by adding 01100 and then ANDing with 10011 (which happens to be the bitwise NOT of the previous number). This is how it works for a=0,1 (first arrow is the addition, second arrow is the AND):
a=0: 00abc = 000bc -> 011bc -> 000bc = a00bc
a=1: 00abc = 001bc -> 100bc -> 100bc = a00bc
Next, move 'b' one place to the left by adding 00010 and then ANDing with 10101:
b=0: a00bc = a000c -> a001c -> a000c = a0b0c
b=1: a00bc = a001c -> a010c -> a010c = a0b0c
That's it.
On Intel x86 processors starting from Haswell, you can use single
pdep
instruction fromBMI2
instruction set to do it:There is a clever way to do this which may be helpful here. It actually solves a slightly more general bit-shuffling problem. Your problem has an input of:
....but let's consider all of the bits:
and attempt to interleave them all like so:
For the first step, consider the middle half of the input:
Construct the 8-bit value: {
I^Q
,J^R
,K^S
,L^a
,M^b
,N^c
,O^d
,P^e
}.If we exclusive-OR this 8-bit value with bits [15:8], and also exclusive-OR the same 8-bit value with bits [23:16], we will swap the middle two bytes: for example, bit 23 (originally
I
) will becomeI ^ (I^Q) = Q
and bit 15 (originallyQ
) will becomeQ ^ (I^Q) = I
.To do that:
tmp = (input ^ (input >> 8)) & 0x0000ff00;
:Now the 8-bit value that we need is in bits [15:8], with all other bits 0. Now we can do the swap with
resulting in:
For the next step, divide and conquer... perform a similar swap of the middle bits of both the left hand half:
...and the right-hand half:
We can use exactly the same trick as in the first step, and because we want to perform exactly the same operation on both 16-bit halves of the 32-bit word, we can do them in parallel:
constructs the two pairs of 4 bits that we will use for the swap, and then
actually does the swap.
We can continue applying the same principle until the swap is complete. The bits that participate in the exchange at each point are marked with
#
:Code:
The reverse operation can be performed by running the 4 steps backwards:
although you may be able to improve on this for your particular application, if every other bit is known to be zero: see my answer to another question here.
As a final note, don't believe anything anyone says about relative performance of any of the methods suggested here without benchmarking them in your application. (In particular, large lookup tables can appear to be much better in simple microbenchmarks than they actually are in practice in a given real application, due to evicting large quantities of other data from the cache, which can have a negative effect on the outer loop(s).)
I think this might be relevant, but I'm not completely certain. I know of MMX instructions for interleaving bytes of 32/64 bit values, but not individual bits.