I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.
For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t
or uint64_t[4]
respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.
However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.
So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)
There is a faster method for multiplying 8x8 matrices using 64-bit multiplication along with some simple bit trickery, which works for either GF[2] or boolean algebra. Assuming the three matrices being packed in 8 consecutive rows of 8 bits inside a 64-bit int each, we can use multiplication to scatter the bits and do the job in just one for loop:
The code for 16x16 is straightfoward if you can afford blocking the rows for improved efficiency. This trick is also used extensively in high-performance linear algebra libraries, and consists in partitioning the matrix into N/M x N/M blocks of MxM submatrices, with M = 2^m chosen to maximize locality in cache. The usual way to deal with N % M != 0 is to pad rows and columns with 0s so one can use the same algorithm for all block multiplications.
We can apply the same ideas to boolean matrices of variable dimension 8 >= N >= 16 as long as we can afford to have the matrices represented internally in a row blocking format. We just assume the matrix is 16x16 and the last 16-N rows and columns are filled with 0s:
Notice we have done a 16x16 matrix multiplication in just 8x8=64 integer products and some bit operations.
Also mul8x8 can be much improved with modern SSE/AVX vector instructions. In theory it is possible to perform all 8 products in parallel with one AVX512 instruction (we still need to scatter the data to the ZMM register first) and then reduce horizontally using lg2(8) = O(3) instructions.
Depending on your application, storing both the matrix and its transpose together might help. You will save a lot of time that otherwise would be used to transpose during matrix multiplications, at the expense of some memory and some more operations.
How about padding it out to the next "clever" (e.g. 8 or 16) size, with all '1' on the diagonal?
Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.
Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity
From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.
This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).
Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:
And after thinking it a bit further, a better option is to
^^^
(row wise rotate) matrix B and select A(n) == column copied diagonals from A:EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.