Is there a way to improve the boost ublas product performance?
I have two matrices A,B which i want to mulitply/add/sub/...
In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations
OPERATION | MATLAB | C++ (MSVC10)
A + B | 0.04 | 0.04
A - B | 0.04 | 0.04
AB | 1.0 | 62.66
A'B' | 1.0 | 54.35
Why there is such a huge performance loss here?
The matrices are only real doubles. But i also need positive definites,symmetric,rectangular products.
EDIT: The code is trivial
matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;
C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));
EDIT 2: The results are mean values of 10 trys. The stddev was less than 0.005
I would expect an factor 2-3 maybe to but not 50 (!)
EDIT 3: Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.
EDIT 4: Preallocated Matrices for the product results did not affect the runtime.
You don't know what role memory-management is playing here.
prod
is having to allocate a 32mb matrix, and so istrans
, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.Other ways matrix-multiplication could be speeded up are
pre-transposing the left-hand matrix, to be cache-friendly, and
skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.
Whether this is done in uBlas is anybody's guess.
You should use
noalias
in the left hand side of matrix multiplications in order to get rid of unnecessary copies.Instead of
E = prod(A,B);
usenoalias(E) = prod(A,b);
From documentation:
There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.
I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.
If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.
Post your C+ code for advice on any possible optimizations.
You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.
I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.