The naive algorithm for multiplying 4x4 matrices looks like this:
void matrix_mul(double out[4][4], double lhs[4][4], double rhs[4][4]) {
for (int i = 0; i < 4; ++i) {
for (int j = 0; j < 4; ++j) {
out[i][j] = 0.0;
for (int k = 0; k < 4; ++k) {
out[i][j] += lhs[i][k] * rhs[k][j];
}
}
}
}
Obviously, this algorithm gives bogus results if out == lhs
or out == rhs
(here ==
means reference equality). Is there a version that allows one or both of those cases that doesn't simply copy the matrix? I'm happy to have different functions for each case if necessary.
I found this paper but it discusses the Strassen-Winograd algorithm which is overkill for my small matrices. The answers to this question seem to indicate that if out == lhs && out == rhs
(i.e., we're attempting to square the matrix), then it can't be done in place, but even there there's no convincing evidence or proof.
I'm not thrilled with this answer (I'm posting it mainly to silence the "it obviously can't be done" crowd), but I'm skeptical that it's possible to do much better with a true in-place algorithm (O(1) extra words of storage for multiplying two n x n matrices). Let's call the two matrices to be multplied A and B. Assume that A and B are not aliased.
If A were upper-triangular, then the multiplication problem would look like this.
We can compute the product into B as follows. Multiply the first row of B by
a11
. Adda12
times the second row of B to the first. Adda13
times the third row of B to the first. Adda14
times the fourth row of B to the first.Now, we've overwritten the first row of B with the correct product. Fortunately, we don't need it any more. Multiply the second row of B by
a22
. Adda23
times the third row of B to the second. (You get the idea.)Likewise, if A were unit lower-triangular, then the multiplication problem would look like this.
Add
a43
times to third row of B to the fourth. Adda42
times the second row of B to the fourth. Adda41
times the first row of B to the fourth. Adda32
times the second row of B to the third. (You get the idea.)The complete algorithm is to LU-decompose A in place, multiply U B into B, multiply L B into B, and then LU-undecompose A in place (I'm not sure if anyone ever does this, but it seems easy enough to reverse the steps). There are about a million reasons not to implement this in practice, two being that A may not be LU-decomposable and that A won't be reconstructed exactly in general with floating-point arithmetic.
This answer is more sensible than my other one, though it uses one whole column of additional storage and has the same amount of data movement as the naive copying algorithm. To multiply A with B, storing the product in B (again assuming that A and B are stored separately):
I switched the pseudocode to do the copy first because for large matrices, caching effects may result in it being more efficient to multiply A by the contiguous auxiliary column as opposed to the non-contiguous entries in B.
This answer is about 4x4 matrices. Assuming, as you propose, that
out
may reference eitherlhs
orrhs
, and that A and B have cells of uniform bit-length, in order to technically be able to perform the multiplication in place, elements of A and B, as signed integers, generally cannot be greater or smaller than± floor (sqrt (2 ^ (cellbitlength - 1) / 4))
.In this case, we can hack the elements of A into B (or vice versa) in the form of a bit shift or a combination of bit flags and modular arithmetic, and compute the product into the former matrix. If A and B were tightly packed, save for special cases or limits, we could not admit
out
to reference eitherlhs
orrhs
.Using the naive method now would not be unlike David's second algorithm description, just with the extra column stored in A or B itself. Alternatively, we could implement the Strassen-Winograd algorithm according to the schedule below, again with no storage outside of
lhs
andrhs
. (The formulation ofp0,...,p6
andC
is taken from page 166 of Jonathan Golan's The Linear Algebra a Beginning Graduate Student Ought to Know.)Schedule:
Each
p
below is a 2x2 quadrant; "x" means unassigned; "nc", no change. To compute eachp
, we use an unassigned 2x2 quadrant to superimpose the (one or) two results of 2x2 block matrix addition or subtraction, using the same bit-shift or modular method above; we then add their product (the seven multiplications resulting in single-elements) directly into the the target block in any order (note that for the 2x2-sizedp2
andp4
, we use the southwest quadrant ofrhs
, which is no longer needed at that point). For example, to write the first 2x2-sizedp6
, we superimpose the block matrix subtraction,rhs(a12) - rhs(a22)
, and block matrix addition,rhs(b21) + rhs(b22)
, onto thelhs21
submatrix; then add each of the seven single-elementp
's for that block multiplication,(a12 - a22) X (b21 + b22)
, directly to thelhs11
submatrix.