Is there an algorithm to multiply square matrices

2019-02-06 15:29发布

The naive algorithm for multiplying 4x4 matrices looks like this:

void matrix_mul(double out[4][4], double lhs[4][4], double rhs[4][4]) {
    for (int i = 0; i < 4; ++i) {
        for (int j = 0; j < 4; ++j) {
            out[i][j] = 0.0;
            for (int k = 0; k < 4; ++k) {
                out[i][j] += lhs[i][k] * rhs[k][j];
            }
        }
    }
}

Obviously, this algorithm gives bogus results if out == lhs or out == rhs (here == means reference equality). Is there a version that allows one or both of those cases that doesn't simply copy the matrix? I'm happy to have different functions for each case if necessary.

I found this paper but it discusses the Strassen-Winograd algorithm which is overkill for my small matrices. The answers to this question seem to indicate that if out == lhs && out == rhs (i.e., we're attempting to square the matrix), then it can't be done in place, but even there there's no convincing evidence or proof.

3条回答
贪生不怕死
2楼-- · 2019-02-06 15:44

I'm not thrilled with this answer (I'm posting it mainly to silence the "it obviously can't be done" crowd), but I'm skeptical that it's possible to do much better with a true in-place algorithm (O(1) extra words of storage for multiplying two n x n matrices). Let's call the two matrices to be multplied A and B. Assume that A and B are not aliased.

If A were upper-triangular, then the multiplication problem would look like this.

[a11 a12 a13 a14] [b11 b12 b13 b14]
[ 0  a22 a23 a24] [b21 b22 b23 b24]
[ 0   0  a33 a34] [b31 b32 b33 b34]
[ 0   0   0  a44] [b41 b42 b43 b44]

We can compute the product into B as follows. Multiply the first row of B by a11. Add a12 times the second row of B to the first. Add a13 times the third row of B to the first. Add a14 times the fourth row of B to the first.

Now, we've overwritten the first row of B with the correct product. Fortunately, we don't need it any more. Multiply the second row of B by a22. Add a23 times the third row of B to the second. (You get the idea.)

Likewise, if A were unit lower-triangular, then the multiplication problem would look like this.

[ 1   0   0   0 ] [b11 b12 b13 b14]
[a21  1   0   0 ] [b21 b22 b23 b24]
[a31 a32  1   0 ] [b31 b32 b33 b34]
[a41 a42 a43  1 ] [b41 b42 b43 b44]

Add a43 times to third row of B to the fourth. Add a42 times the second row of B to the fourth. Add a41 times the first row of B to the fourth. Add a32 times the second row of B to the third. (You get the idea.)

The complete algorithm is to LU-decompose A in place, multiply U B into B, multiply L B into B, and then LU-undecompose A in place (I'm not sure if anyone ever does this, but it seems easy enough to reverse the steps). There are about a million reasons not to implement this in practice, two being that A may not be LU-decomposable and that A won't be reconstructed exactly in general with floating-point arithmetic.

查看更多
家丑人穷心不美
3楼-- · 2019-02-06 15:47

This answer is more sensible than my other one, though it uses one whole column of additional storage and has the same amount of data movement as the naive copying algorithm. To multiply A with B, storing the product in B (again assuming that A and B are stored separately):

For each column of B,
    Copy it into the auxiliary storage column
    Compute the product of A and the auxiliary storage column into that column of B

I switched the pseudocode to do the copy first because for large matrices, caching effects may result in it being more efficient to multiply A by the contiguous auxiliary column as opposed to the non-contiguous entries in B.

查看更多
小情绪 Triste *
4楼-- · 2019-02-06 16:07

This answer is about 4x4 matrices. Assuming, as you propose, that out may reference either lhs or rhs, and that A and B have cells of uniform bit-length, in order to technically be able to perform the multiplication in place, elements of A and B, as signed integers, generally cannot be greater or smaller than ± floor (sqrt (2 ^ (cellbitlength - 1) / 4)).

In this case, we can hack the elements of A into B (or vice versa) in the form of a bit shift or a combination of bit flags and modular arithmetic, and compute the product into the former matrix. If A and B were tightly packed, save for special cases or limits, we could not admit out to reference either lhs or rhs.

Using the naive method now would not be unlike David's second algorithm description, just with the extra column stored in A or B itself. Alternatively, we could implement the Strassen-Winograd algorithm according to the schedule below, again with no storage outside of lhs and rhs. (The formulation of p0,...,p6 and C is taken from page 166 of Jonathan Golan's The Linear Algebra a Beginning Graduate Student Ought to Know.)

p0 = (a11 + a12)(b11 + b12), p1 = (a11 + a22)b11, p2 = a11(b12 - b22),
p3 = (a21 - a11)(b11 + b12), p4 = (a11 + a12)b22, p5 = a22(b21 - b11),
p6 = (a12 - a22)(b21 + b22)
    ┌                                      ┐
c = │ p0 + p5 - p4 + p6,      p2 + p4      │ 
    │   p1 + p5        , p0 - p1 + p2 + p3 │
    └                                      ┘

Schedule:

Each p below is a 2x2 quadrant; "x" means unassigned; "nc", no change. To compute each p, we use an unassigned 2x2 quadrant to superimpose the (one or) two results of 2x2 block matrix addition or subtraction, using the same bit-shift or modular method above; we then add their product (the seven multiplications resulting in single-elements) directly into the the target block in any order (note that for the 2x2-sized p2 and p4, we use the southwest quadrant of rhs, which is no longer needed at that point). For example, to write the first 2x2-sized p6, we superimpose the block matrix subtraction, rhs(a12) - rhs(a22), and block matrix addition, rhs(b21) + rhs(b22), onto the lhs21 submatrix; then add each of the seven single-element p's for that block multiplication, (a12 - a22) X (b21 + b22), directly to the lhs11 submatrix.

LHS            RHS (contains A and B)

(1)
p6   x
x    p3

(2)
+p0  x
p0   +p0

(3)
+p5  x
p5   nc

(4)
nc   p1
+p1  -p1

(5)
-p4  p4        p4 (B21)
nc   nc

(6)
nc  +p2        p2 (B21)
nc  +p2
查看更多
登录 后发表回答