Is there a way to improve the boost ublas product performance?
I have two matrices A,B which i want to mulitply/add/sub/...
In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations
OPERATION | MATLAB | C++ (MSVC10)
A + B | 0.04 | 0.04
A - B | 0.04 | 0.04
AB | 1.0 | 62.66
A'B' | 1.0 | 54.35
Why there is such a huge performance loss here?
The matrices are only real doubles.
But i also need positive definites,symmetric,rectangular products.
EDIT:
The code is trivial
matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;
C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));
EDIT 2:
The results are mean values of 10 trys. The stddev was less than 0.005
I would expect an factor 2-3 maybe to but not 50 (!)
EDIT 3:
Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.
EDIT 4:
Preallocated Matrices for the product results did not affect the runtime.
Post your C+ code for advice on any possible optimizations.
You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.
I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.
You should use noalias
in the left hand side of matrix multiplications in order to get rid of unnecessary copies.
Instead of E = prod(A,B);
use noalias(E) = prod(A,b);
From documentation:
If you know for sure that the left hand expression and the right hand
expression have no common storage, then assignment has no aliasing. A
more efficient assignment can be specified in this case: noalias(C) =
prod(A, B); This avoids the creation of a temporary matrix that is
required in a normal assignment. 'noalias' assignment requires that
the left and right hand side be size conformant.
There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.
I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.
If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.
You don't know what role memory-management is playing here. prod
is having to allocate a 32mb matrix, and so is trans
, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.
Other ways matrix-multiplication could be speeded up are
pre-transposing the left-hand matrix, to be cache-friendly, and
skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.
Whether this is done in uBlas is anybody's guess.