Note: I've posted this also on Eigen forum here
I want to premultiply 3xN matrices by a 3x3 matrix, i.e., to transform 3D points, like
p_dest = T * p_source
after initializing the matrices:
Eigen::Matrix<double, 3, Eigen::Dynamic> points = Eigen::Matrix<double, 3, Eigen::Dynamic>::Random(3, NUMCOLS);
Eigen::Matrix<double, 3, Eigen::Dynamic> dest = Eigen::Matrix<double, 3, Eigen::Dynamic>(3, NUMCOLS);
int NT = 100;
I have evaluated this two versions
// eigen direct multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
}
and
// col multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
}
}
the NT repetition are done just to compute average time
I am surprised the the column by column multiplication is about 4/5 time faster than the direct multiplication
(and the direct multiplication is even slower if I do not use the .noalias()
, but this is fine since it is doing a temporary copy)
I've tried to change NUMCOLS from 0 to 1000000 and the relation is linear.
I'm using Visual Studio 2013 and compiling in release
The next figure shows on X the number of columns of the matrix and in Y the avg time for a single operation, in blue the col by col multiplication, in red the matrix multiplication
Any suggestion why this happens?
Short answer
You're timing the lazy (and therefore lack of) evaluation in the col multiplication version, vs. the lazy (but evaluated) evaluation in the direct version.
Long answer
Instead of code snippets, let's look at a full MCVE. First, "you're" version:
void ColMult(Matrix3Xd& dest, Matrix3Xd& points)
{
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
}
}
void EigenDirect(Matrix3Xd& dest, Matrix3Xd& points)
{
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
}
int main(int argc, char *argv[])
{
srand(time(NULL));
int NUMCOLS = 100000 + rand();
Matrix3Xd points = Matrix3Xd::Random(3, NUMCOLS);
Matrix3Xd dest = Matrix3Xd(3, NUMCOLS);
Matrix3Xd dest2 = Matrix3Xd(3, NUMCOLS);
int NT = 200;
// eigen direct multiplication
auto beg1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < NT; i++)
{
EigenDirect(dest, points);
}
auto end1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end1-beg1;
// col multiplication
auto beg2 = std::chrono::high_resolution_clock::now();
for(int i = 0; i < NT; i++)
{
ColMult(dest2, points);
}
auto end2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds2 = end2-beg2;
std::cout << "Direct time: " << elapsed_seconds.count() << "\n";
std::cout << "Col time: " << elapsed_seconds2.count() << "\n";
std::cout << "Eigen speedup: " << elapsed_seconds2.count() / elapsed_seconds.count() << "\n\n";
return 0;
}
With this code (and SSE turned on), I get:
Direct time: 0.449301
Col time: 0.10107
Eigen speedup: 0.224949
Same 4-5 slowdown you complained of. Why?!?! Before we get to the answer, let's modify the code a bit so that the dest
matrix is sent to an ostream
. Add std::ostream outPut(0);
to the beginning of main()
and before ending the timers add outPut << dest << "\n\n";
and outPut << dest2 << "\n\n";
. The std::ostream outPut(0);
doesn't output anything (I'm pretty sure the badbit is set), but it does cause Eigens operator<<
to be called, which forces the evaluation of the matrix.
NOTE: if we used outPut << dest(1,1)
then dest
would be evaluated only enough to output the single element in the col multiplication method.
We then get
Direct time: 0.447298
Col time: 0.681456
Eigen speedup: 1.52349
as a result as expected. Note that the Eigen direct method took the exact(ish) same time (meaning the evaluation took place even without the added ostream
), whereas the col method all of the sudden took much longer.