Eigen3 matrix multiplication performance

Note: I've posted this also on Eigen forum here

I want to premultiply 3xN matrices by a 3x3 matrix, i.e., to transform 3D points, like p_dest = T * p_source

after initializing the matrices:

Eigen::Matrix<double, 3, Eigen::Dynamic> points = Eigen::Matrix<double, 3, Eigen::Dynamic>::Random(3, NUMCOLS);
Eigen::Matrix<double, 3, Eigen::Dynamic> dest = Eigen::Matrix<double, 3, Eigen::Dynamic>(3, NUMCOLS);
int NT = 100;

I have evaluated this two versions

// eigen direct multiplication
for (int i = 0; i < NT; i++){
  Eigen::Matrix3d T = Eigen::Matrix3d::Random();
  dest.noalias() = T * points;
}

and

// col multiplication
for (int i = 0; i < NT; i++){
  Eigen::Matrix3d T = Eigen::Matrix3d::Random();
  for (int c = 0; c < points.cols(); c++){
    dest.col(c) = T * points.col(c);
  }
}

the NT repetition are done just to compute average time

I am surprised the the column by column multiplication is about 4/5 time faster than the direct multiplication (and the direct multiplication is even slower if I do not use the .noalias(), but this is fine since it is doing a temporary copy) I've tried to change NUMCOLS from 0 to 1000000 and the relation is linear.

I'm using Visual Studio 2013 and compiling in release

The next figure shows on X the number of columns of the matrix and in Y the avg time for a single operation, in blue the col by col multiplication, in red the matrix multiplication

Any suggestion why this happens?

Short answer

You're timing the lazy (and therefore lack of) evaluation in the col multiplication version, vs. the lazy (but evaluated) evaluation in the direct version.

Long answer

Instead of code snippets, let's look at a full MCVE. First, "you're" version:

void ColMult(Matrix3Xd& dest, Matrix3Xd& points)
{
    Eigen::Matrix3d T = Eigen::Matrix3d::Random();
    for (int c = 0; c < points.cols(); c++){
        dest.col(c) = T * points.col(c);
    }
}

void EigenDirect(Matrix3Xd& dest, Matrix3Xd& points)
{
    Eigen::Matrix3d T = Eigen::Matrix3d::Random();
    dest.noalias() = T * points;
}

int main(int argc, char *argv[])
{
    srand(time(NULL));

    int NUMCOLS = 100000 + rand();

    Matrix3Xd points = Matrix3Xd::Random(3, NUMCOLS);
    Matrix3Xd dest   = Matrix3Xd(3, NUMCOLS);
    Matrix3Xd dest2  = Matrix3Xd(3, NUMCOLS);
    int NT = 200;
    // eigen direct multiplication
    auto beg1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < NT; i++)
    {
        EigenDirect(dest, points);
    }
    auto end1 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> elapsed_seconds = end1-beg1;

    // col multiplication
    auto beg2 = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < NT; i++)
    {
        ColMult(dest2, points);
    }

    auto end2 = std::chrono::high_resolution_clock::now();

    std::chrono::duration<double> elapsed_seconds2 = end2-beg2;
    std::cout << "Direct time: " << elapsed_seconds.count() << "\n";
    std::cout << "Col time: " << elapsed_seconds2.count() << "\n";

    std::cout << "Eigen speedup: " << elapsed_seconds2.count() / elapsed_seconds.count() << "\n\n";
    return 0;
}

With this code (and SSE turned on), I get:

Direct time: 0.449301
Col time: 0.10107
Eigen speedup: 0.224949

Same 4-5 slowdown you complained of. Why?!?! Before we get to the answer, let's modify the code a bit so that the dest matrix is sent to an ostream. Add std::ostream outPut(0); to the beginning of main() and before ending the timers add outPut << dest << "\n\n"; and outPut << dest2 << "\n\n";. The std::ostream outPut(0); doesn't output anything (I'm pretty sure the badbit is set), but it does cause Eigens operator<< to be called, which forces the evaluation of the matrix.

NOTE: if we used outPut << dest(1,1) then dest would be evaluated only enough to output the single element in the col multiplication method.

We then get

Direct time: 0.447298
Col time: 0.681456
Eigen speedup: 1.52349

as a result as expected. Note that the Eigen direct method took the exact(ish) same time (meaning the evaluation took place even without the added ostream), whereas the col method all of the sudden took much longer.