OK, let's consider a 64-bit number, with its bits forming a 8x8 table.
E.g.
0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0
written as
a b c d e f g h
----------------
0 1 1 0 1 0 1 0
0 1 1 0 1 0 1 1
0 1 1 1 1 0 1 0
0 1 1 0 1 0 1 0
1 1 1 0 1 0 1 0
0 1 1 0 1 0 1 0
0 1 1 0 1 1 1 0
0 1 1 0 1 0 1 0
Now, what if we want to isolate JUST e.g. column d (00100000
) (or any row/diagonal for that matter) ?
Can this be done? And if so, how?
HINTS :
(a) My main objective here - though not initially mentioned - is raw speed. I'm searching for the fastest algorithm around, since the "retrieval" function is being performed some millions of times per second.
(b) This is what comes closer to what I mean : https://www.chessprogramming.org/Kindergarten_Bitboards
In your case (specialized for 8x8 = 64 bit tables), you need to perform bitshifting to extract the specific bits and re-arrange them in a new integer variable, also using bitshifting:
See here: http://ideone.com/UlWAAO
How about this...
Here's a solution that can execute once a cycle (if the value and mask are in registers), if you are willing to use the intrinsic for the
PEXT
instruction on Intel (and if you are doing bitboard stuff, you likely are):That's for the 0th column - if you want another one just shift the mask appropriately. Of course, this is cheating by using a hardware specific instruction, but bitboards are all about cheating.
Here's a solution with only 4 main steps:
It works like this:
The magic number is chosen to copy only the needed bits and let the rest fall into unused places / overflow over the number. The process looks like this (digits are bit "IDs", rather than the number itself):
If you add the
const
keywords, assembly becomes quite nice actually:No branching, no external data, around 0.4ns per calculation.
Edit: takes around 6th of the time using NPE's solution as baseline (next fastest one)
Right, so to "settle" the debate as to which is faster/slower/etc, I've put all the code into one program [and I hope I've credited the right person for the right code-snippet].
The code can be found below, for inspection that I've intepreded the code correctly when I've made it into functions. I did run it wout proper output and check that each function gives the same result [bearing in mind that the order is slightly different in some cases - so I made a variation to run the other way of my code, just to see that it gives the "right" result]. So without further ado, here's the results:
(viraptor's results from core i5, g++ 4.7)
(viraptor's results from core i5, clang++ 3.2)
That's clock-cycles on a 3.4GHz AMD Athlon2 - I don't have a modern Intel machine - if someone wishes to run the code on that, I'd be interested to see what it looks like. I'm fairly sure all of it runs well within the cache - perhaps aside from fetching some of the values in to check.
So, the winner is clearly viraptor, by about 40% - "my" code is second. Alex's code doesn't have any jumps/branches, but it appears to run slower than the other alternatives still. Not sure why npe's results are that much slower than mine - it does almost the same thing (and the code looks very similar when looking at the assembler output from g++).
Code it up with straightforward loops and let the optimizer do the inlining and loop unrolling for you.
Compiled using 4.7.2 with
-O3
, on my box the following can perform about 300 millionget_col()
calls per second.bitboard.cpp:
bitboard_b.cpp:
If you look at the assembly code for
get_col()
, you'll see that it contains zero loops and is probably as efficient as anything you're likely to handcraft:This is not meant a complete implementation, just an rough illustration of the idea. In particular, the ordering of bits may be the opposite of what you expect, etc.