I tried to implement the Strassen algorithm for matrix multiplication with C++, but the result isn't that, what I expected. As you can see strassen always takes more time then standard implementation and only with a dimension from a power of 2 is as fast as standard implementation. What went wrong?
matrix mult_strassen(matrix a, matrix b) {
if (a.dim() <= cut)
return mult_std(a, b);
matrix a11 = get_part(0, 0, a);
matrix a12 = get_part(0, 1, a);
matrix a21 = get_part(1, 0, a);
matrix a22 = get_part(1, 1, a);
matrix b11 = get_part(0, 0, b);
matrix b12 = get_part(0, 1, b);
matrix b21 = get_part(1, 0, b);
matrix b22 = get_part(1, 1, b);
matrix m1 = mult_strassen(a11 + a22, b11 + b22);
matrix m2 = mult_strassen(a21 + a22, b11);
matrix m3 = mult_strassen(a11, b12 - b22);
matrix m4 = mult_strassen(a22, b21 - b11);
matrix m5 = mult_strassen(a11 + a12, b22);
matrix m6 = mult_strassen(a21 - a11, b11 + b12);
matrix m7 = mult_strassen(a12 - a22, b21 + b22);
matrix c(a.dim(), false, true);
set_part(0, 0, &c, m1 + m4 - m5 + m7);
set_part(0, 1, &c, m3 + m5);
set_part(1, 0, &c, m2 + m4);
set_part(1, 1, &c, m1 - m2 + m3 + m6);
return c;
}
PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3
.
Ok I am no expert in this field, but there might be other issues at work here than processing speed. First the strassen method uses way more stack and has more function calls, which add memory movement. You have a certain penalty the larger your stack gets, since it needs to request larger frames from the OS. Plus you use dynamic allocation, this is also an issue.
Try to use a fixed size (with template parameter) matrix class? This will at least resolve the allocation issue.
Note: I am not sure that it event works properly with your code. Your matrix class uses pointers but has no copy constructor or assignment operator. You are also leaking memory at the end, since you don't have a destructor...
The big O of Strassen is O(N ^ log 7) compared to O(N ^ 3) regular, i.e. log 7 base 2 which is slightly less than 3.
That is the number of multiplications you need to make.
It assumes there is no cost to anything else you have, and also should be "faster" only as N gets large enough which yours probably does not.
Much of your implementation is creating lots of sub-matrices and my guess is the way you are storing them you are having to allocate memory and copy every time you do this. Having some kind of "slice" matrix and logical-transpose matrix if you can would help you optimise what is probably the slowest part of your process.
I am actually shocked at how much faster my Stassen multiplcation
implementation is:
http://ezekiel.vancouver.wsu.edu/~cs330/lectures/linear_algebra/mm/mm.c
I get an almost 16x speedup on my machine when n=1024.
The only way I can explain this much of a speed-up is that
my algorithm is more cache-friendly -- i.e., it focuses on small
pieces of the matrices and thus the data is more localized.
The overhead in your C++ implementation is probably too high -- the
compiler generates more temporaries than what is really necessary.
My implementation tries to minimizing this by reusing memory whenever
possible.
Long shot, but have you considered that the standard multiplication may be optimised by the compiler? Could you switch off optimisations?