I've read tons of web hits related to this issue, and I still haven't come across any definitive answer.
What I'd like to do is to make a database of chess positions, capable of identifying transpositions (generally which pieces are on which squares).
EDIT: it should also be capable to identify similar (but not exactly identical) positions.
This is a discussion almost 20 years ago (when space was an issue): https://groups.google.com/forum/#!topic/rec.games.chess.computer/wVyS3tftZAA
One of the discussants talk about encoding pieces on a square matrix, using 4 x 64 bits plus some bits more for the additional information (castling, en passant etc): there are six pieces (Pawn, Rook, Knight, Bishop, Queen, King) plus an empty square, that would be 3 bits (2^3), and one more bit for the color of the piece.
In total, there would be 4 numbers of 64bits each, plus some additional information.
Question: is there any other, more efficient way of storing a chess position?
I should probably mention this question is database centric, not game centric (i.e. my sole interest is to efficiently store and retrieve, not to create any AI or to generate any moves).
Thanks, Adrian
Take a loot at the Forsyth–Edwards Notation (FEN). It is described here. It is also well known and supported by many engines and chess programs.
Here is the FEN for the starting position:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Fen is seperated in 6 segments.
Segment 1 contains the pieces. black pieces are in lower case, white pieces are in upper case.
Segment 2 states, who's turn is it. (w or b)
Segment 3 is for castling. KQkq means both can castle on both sides.
K = King side white
q = queen side black
Segment 4 En passant target square in algebraic notation. If there's no en passant target square, this is "-". If a pawn has just made a two-square move, this is the position "behind" the pawn. This is recorded regardless of whether there is a pawn in position to make an en passant capture
Segment 5 Halfmove clock: This is the number of halfmoves since the last capture or pawn advance. This is used to determine if a draw can be claimed under the fifty-move rule.
Segment 6 Fullmove number: The number of the full move. It starts at 1, and is incremented after Black's move.
If you do not need a decodable position representation for comparisons then you could look at Zobrist hashing. This is used by chess engines to produce a 64 bit oneway hash of a position for spotting transpositions in search trees. As it is a oneway hash you obviously cannot reverse the position from the hash. The size of the hash is tunable, but 64 bits seems to be the accepted minimum size that results in few collisions. It would be ideal as a database index key with a fixed length of just 8 bytes. As collisions, though infrequent, are possible you could do a second pass comparing the actual positions to filter out any positions that have hashed to the same value if it is a concern. I use Zobrist hashes in one of my own applications (using SQLite) that I use to manage my openings and it has no trouble in finding transpositions.
Consider up to 32 pieces on the board. Each piece can be on one of the 64 squares. Representing the piece positions independendly in predetermined order requires 32*6=192 bits.
In addition to that each pawn can be promoted to a rook, bishop, knight or queen, so for each pawn we need to encode its state in 3 additional bits (4 possible piecetypes and normal pawn).
32*6+16*3 = 240 bit/ 30 byte
In many cases you will need additional information about the state of the game/variant the position arises in:
EnPassent File: 4 bits (8 files and none)
Castlerights: 4 bits (short/long white/black)
sideToMove : 1 bit (white/black)
which adds up to 249 bit/32 bytes.
This might not be the most compact representation, but it is easy to en-/decode.
You could use a modified run length encoding where each piece is encoded as a piece number (3 bits), with
0y111
used to skip ahead spaces. As there are many situations where pieces are next to each other, you end up omitting the positional information:The decoder starts off at a1 and proceed to the right, moving up at the end of a row, so the encoding for a starting board would be:
That being said, the additional complexity and uncertainty of a variable length encoding is probably not worth the space savings. Further, it may be worse than a constant width format in certain plays.
There are 32 pieces on the board, and 64 squares. Square index can be represented with a 6-bit number, so to represent the locations of each piece you need 32 six-bit numbers, or a total of 192 bits, which is less than 4x64.
You can do a bit better by realizing that not all positions are possible (e.g. a pawn cannot reach the end row of its own color) and using less than six bits for the position in these cases. Also, a position already occupied by another piece makes that position unavailable for other pieces.
As a piece may also be totally missing from the board, you should start with the kings' positions, as they are always there - and then, encoding another piece's position as the same of a king would mean that the piece has been taken.
Edit:
A short analysis of the pieces' possible positions:
Thus, this set of legal chess positions can be described trivially with an integer from zero up to (64^12 * 32^4 * 21^4 * 26^4 * 30^4 * 32^8)-1, or 391935874857773690005106949814449284944862535808450559999, which fits into 188 bits. Encoding and decoding a position to and from this is very straightforward - however, there are multiple numbers that decode into the same position (e.g. white knight 1 at B1 and white knight 2 at G1; and white knight 1 at G1 and white knight 2 at B1).
Due to the fact that no two pieces can occupy the same square, there is a tighter limit but it is a bit difficult to both encode and decode, so probably not useful in a real application. Also, the number shown above is pretty close to 2^188, so I don't think even this tighter encoding would fit into 187 bits.
My two cents
Short version
Longer version
32 bytes (4*64 bits) is quite small amount of data. 1000 million chess positions could fit in to 30 gigabytes. 192 bits is 24 bytes this would make in to 23 gigabytes. Probably database use some kind of compression and thus the in disk might be less than these figures. I don't know what kind of limits there are for storage, but because these seems quite tight encodings it might not be worth the effort try to minimize more.
Because ability to find similar positions was required I think the encoding should make it easy to compare different positions. Preferably this could be counted without decoding. For this to work the encoding should probably be constant length (can't think easy way to do this with variable length coding).
Indexing might speed up similarity search. Naive approach would be index all the positions by piece locations in database. This would make 32 indexes (and maybe for additional information also). It would make the search lightning fast at least in theory.
Indexes are going to take quite much space. Probably more than the actual positional data. And still they might not help that much. For example finding positions where black king is in, or near e4 required 9 searches using the index and then hopping around the 30 gigabytes of positional information which is likely need disk access in random locations. And probably finding similar positions is done for more than one piece...
If the storage format is efficient it might just be enough to brute force (like this)through all positional data and check the similarity position by position. This will use the CPU caches efficiently. Also because of the constant length record it is easy to divide the work to multiple processors or machines.
Whether to use piece-centric or board-based storage format depends on how you are going to calculate the similarity of two positions compared to each others. Piece-centric gives easy way to calculate distance of one piece in two different positions. However in piece-centric approach every piece is identified separately thus it is not so easy to find a pawn in certain location. One has to check every pawns location. If the piece identity is not so important, then board-based storage makes it easy to just check if a pawn is in wanted location. On the other hand it is not possible to check which exact pawn there is.