Fast multiplication of k x k boolean matrices, whe

2020-07-07 10:44发布

I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.

For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t or uint64_t[4] respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.

However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.

So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)

4条回答
男人必须洒脱
2楼-- · 2020-07-07 11:10

There is a faster method for multiplying 8x8 matrices using 64-bit multiplication along with some simple bit trickery, which works for either GF[2] or boolean algebra. Assuming the three matrices being packed in 8 consecutive rows of 8 bits inside a 64-bit int each, we can use multiplication to scatter the bits and do the job in just one for loop:

uint64_t mul8x8 (uint64_t A, uint64_t B) {

    const uint64_t ROW = 0x00000000000000FF;
    const uint64_t COL = 0x0101010101010101;

    uint64_t C = 0;

    for (int i=0; i<8; ++i) {
        uint64_t p = COL & (A>>i);
        uint64_t r = ROW & (B>>i*8);
        C |= (p*r); // use ^ for GF(2) instead
    }
    return C;
}

The code for 16x16 is straightfoward if you can afford blocking the rows for improved efficiency. This trick is also used extensively in high-performance linear algebra libraries, and consists in partitioning the matrix into N/M x N/M blocks of MxM submatrices, with M = 2^m chosen to maximize locality in cache. The usual way to deal with N % M != 0 is to pad rows and columns with 0s so one can use the same algorithm for all block multiplications.

We can apply the same ideas to boolean matrices of variable dimension 8 >= N >= 16 as long as we can afford to have the matrices represented internally in a row blocking format. We just assume the matrix is 16x16 and the last 16-N rows and columns are filled with 0s:

void mul16x16 (uint64_t C[2][2], const uint64_t A[2][2], const uint64_t B[2][2]) {

    for (int i=0; i<2; ++i)
        for (int j=0; j<2; ++j)
            C[i][j] = mul8x8(A[i][0],B[0][j])
                    | mul8x8(A[i][1],B[1][j]); // once again, use ^ instead for GF(2)
}

Notice we have done a 16x16 matrix multiplication in just 8x8=64 integer products and some bit operations.

Also mul8x8 can be much improved with modern SSE/AVX vector instructions. In theory it is possible to perform all 8 products in parallel with one AVX512 instruction (we still need to scatter the data to the ZMM register first) and then reduce horizontally using lg2(8) = O(3) instructions.

查看更多
做个烂人
3楼-- · 2020-07-07 11:15

Depending on your application, storing both the matrix and its transpose together might help. You will save a lot of time that otherwise would be used to transpose during matrix multiplications, at the expense of some memory and some more operations.

查看更多
狗以群分
4楼-- · 2020-07-07 11:28

How about padding it out to the next "clever" (e.g. 8 or 16) size, with all '1' on the diagonal?

查看更多
等我变得足够好
5楼-- · 2020-07-07 11:32

Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.

Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity

From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.

This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).

Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:

    a b c          a e i          d h c          g b f
B=  d e f  B(0) =  a e i  B(1) =  d h c   B(2) = g b f
    g h i          a e i          d h c          g b f

And after thinking it a bit further, a better option is to ^^^ (row wise rotate) matrix B and select A(n) == column copied diagonals from A:

    a b c         a a a           b b b           c c c 
A=  d e f  A(0) = e e e , A(1) =  f f f,  A(2) =  d d d 
    g h i         i i i           g g g           h h h 

EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.

#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
    // these arrays can be read in two successive xmm registers or in a single ymm
    uint16_t D[16];      // Temporary
    uint16_t C[16]={0};  // result
    uint16_t B[16];  
    uint16_t A[16];
    int i,j;
    uint16_t top_row;
    // Preprocess B (while reading from input) 
    // -- "un-tilt" the diagonal to bit position 0x8000
    for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
    for (i=0;i<W;i++) A[i]=a[i];  // Just read in matrix 'a'
    // Loop W times
    // Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
    for (j=0;j<W;j++) {
        for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15;  // copy sign bit to rows
        for (i=0;i<W;i++) B[i]<<=1;                  // Prepare B for next round
        for (i=0;i<W;i++) C[i]^= A[i]&D[i];          // Add the partial product

        top_row=A[0];
        for (i=0;i<W-1;i++) A[i]=A[i+1];
        A[W-1]=top_row;
    }
    for (i=0;i<W;i++) c[i]=C[i];      // return result
}
查看更多
登录 后发表回答