BOOST uBLAS matrix product extremely slow

2019-02-17 22:26发布

问题:

Is there a way to improve the boost ublas product performance?

I have two matrices A,B which i want to mulitply/add/sub/...

In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations

OPERATION | MATLAB | C++ (MSVC10)
A + B     |  0.04  |  0.04
A - B     |  0.04  |  0.04
AB        |  1.0   | 62.66
A'B'      |  1.0   | 54.35

Why there is such a huge performance loss here?

The matrices are only real doubles. But i also need positive definites,symmetric,rectangular products.

EDIT: The code is trivial

matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;

C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));

EDIT 2: The results are mean values of 10 trys. The stddev was less than 0.005

I would expect an factor 2-3 maybe to but not 50 (!)

EDIT 3: Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.

EDIT 4: Preallocated Matrices for the product results did not affect the runtime.

回答1:

Post your C+ code for advice on any possible optimizations.

You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.

I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.



回答2:

You should use noalias in the left hand side of matrix multiplications in order to get rid of unnecessary copies.

Instead of E = prod(A,B); use noalias(E) = prod(A,b);

From documentation:

If you know for sure that the left hand expression and the right hand expression have no common storage, then assignment has no aliasing. A more efficient assignment can be specified in this case: noalias(C) = prod(A, B); This avoids the creation of a temporary matrix that is required in a normal assignment. 'noalias' assignment requires that the left and right hand side be size conformant.



回答3:

There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.

I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.

If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.



回答4:

You don't know what role memory-management is playing here. prod is having to allocate a 32mb matrix, and so is trans, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.

Other ways matrix-multiplication could be speeded up are

  • pre-transposing the left-hand matrix, to be cache-friendly, and

  • skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.

Whether this is done in uBlas is anybody's guess.