I have a 32-bit value in the lower part of a 64-bit register; the top part is 0 (X
denotes a bit with information, bits listed from LSB to MSB):
X X X ... X 0 0 0 0 ... 0
Now, I want to "space out" the bits with information, so that I have
X 0 X 0 X 0 ... X 0
(or if you'd rather put the 0's first, then
0 X 0 X 0 X 0 ... X
is fine too.)
What's a fast way to do that?
A multi-CPU-architecture-relevant answer would be nice, but something specific to Intel x86_64 and/or nVIDIA Pascal SM's would be the most relevant.
This is known as Morton number, which is a specific case of parallel expand which is in turn the reverse of compress right in the following questions
One general solution might be
However the constant generation might be inefficient on RISC architectures because the 64-bit immediate can't be stored in a single instruction like on x86. Even on x86 the output assembly is quite long. Here is another possible implementation as described on Bit Twiddling Hacks
A lookup table can also be used
The size of the lookup table may be increased if necessary
On x86 with BMI2 there's hardware support with PDEP instruction which can be accessed via the following intrinsic
Another solution on architectures without bit deposit/expand instruction but with fast multipliers
The way it works is like this
abcdefgh
to a0000000 b0000000 c0000000 d0000000 e0000000 f0000000 g0000000 h0000000 and store inexpand8bits
spacedOutBits
will contain a0b0c0d0 00000000 00000000 00000000 e0f0g0h0 00000000 00000000 00000000. We'll merge the two bytes in the result togetherThe magic number for bringing the bits closer is calculated like this
The output assembly can be seen here. You can change the compiler to see how it's done on various architectures
There's also an alternative way on the Bit Twiddling Hacks page
More solutions can be found in Portable efficient alternative to PDEP without using BMI2?
Related: How to do bit striping on pixel data?
As you can see, without the availability of a bit deposit instruction it'll be quite complex in terms of operations. If you do a not of bit striping like this then it'll be better to do in parallel using SIMD