OK, it may sound a bit complicated, but this is what I'm trying to do :
- Take e.g.
10101010101
- And return
{ 0, 2, 4, 6, 8, 10 }
- an array with all of the positions of bits which are set
This is my code :
UINT DQBitboard::firstBit(U64 bitboard)
{
static const int index64[64] = {
63, 0, 58, 1, 59, 47, 53, 2,
60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20,
55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41,
50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12,
44, 24, 15, 8, 23, 7, 6, 5 };
static const U64 debruijn64 = 0x07EDD5E59A4E28C2ULL;
#pragma warning (disable: 4146)
return index64[((bitboard & -bitboard) * debruijn64) >> 58];
}
vector<UINT> DQBitboard::bits(U64 bitboard)
{
vector<UINT> res;
while (bitboard)
{
UINT first = DQBitboard::firstBit(bitboard);
res.push_back(first);
bitboard &= ~(1ULL<<first);
}
return res;
}
And the code surely does work.
My point is :
- Is there any faster implementation you have in mind?
- Do you notice anything that could be optimized? If so, what?
Hints :
UINT
is a typedef ofunsigned int
U64
is a typedef ofunsigned long long
- Both methods are
static inline
.
I have been wondering whether it would work faster with the bst assembly instruction. So I tried 3 implementations and got the following results for 10 million iterations:
Your implementation (Dr.Kameleon) 1.77 seconds
The log2() implementation (icepack) 2.17 seconds
My assembly implementation (me) 0.16 seconds
Output:
One point about C/C++, static is horrendous. It is actually compiled in a list of CPU instructions (NOT what I would expect either!!!) Instead, use an array outside your function in a nameless namespace. That will have the expected effect. Although in assembly you can use the .long (or some other size) and then %rip to reference the data from the IP.
Note: once compiled, I do not see the size (n) being used in my assembly version so I'm not too sure whether the returned array is valid. Outside of that, the code itself becomes a loop of 5 assembly instructions, hence the tiny increase in speed (about x10).
The reason for log2() slowness is that it converts the number to an xmm register and then does call to another function. It then converts the xmm register back to the regular register.
Compiled with g++ with this command line:
Although you may think that the -msse4.2 could be the problem with the log2() version, I tried without it and log2() is slower then.
Btw, I do not recommend this method since it is not portable. Only the Intel based computers will understand those instructions.
I tried a naive version, which clocked in around 2-3x faster, but reserved()'d the vector first. On applying the reserve to the original algorithm, it beat the naive one.
So I suspect that the vector operations are the bigger cost here, rather than the bit manipulation (or even the multiply used in the next-bit function).
There are some other speed-ups to be had, regarding finding the lowest set bit. My local log2 version was poor, and the function given in the original post is not super cheap.
Here's my best attempt:
Replaced the firstBit function with an asm intrinsic. Using the intrinsic gave a big boost here. (This is obviously not portable code, though I suspect a GCC variant shouldn't be too tricky).
Also assumes the vector is reasonably persistent, rather than being dynamically allocated/copied all the time, and just resizes it appropriately.
Bit shifts are really cheap. Lookup tables cost cache space, and your lookup also have integer multiplication. Just brute-force will be faster than clever techniques, I expect.
You can unroll the loop a little bit, that may make it faster yet.
The
std::vector
is the most expensive part by far. Consider using the bitboard directly. For example:Now you can write
and I think you'll get your list of bits.
Version 2:
Replace your
firstBit
function with an intrinsic using theBSF
orBSR
instruction for a massive speedup.In gcc, it'd be
__builtin_ffsll
and__builtin_ctzll
With Visual C+,
_BitScanForward
and_BitScanReverse
I actually think the fastest, simplest method is simply to loop around, but if we pass in a vector instead of later making a copy, it should be a little faster.
The multiply in the findfirst will cost a bit, so I doubt it's really worth it.