I'm creating some functions to do things like the "separated sum" of negative and positive number, kahan, pairwise and other stuff where it doesn't matter the order I take the elements from the matrix, for example:
template <typename T, int R, int C>
inline T sum(const Eigen::Matrix<T,R,C>& xs)
{
T sumP(0);
T sumN(0);
for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
for (size_t j = 0; j < nCols; ++j)
{
if (xs(i,j)>0)
sumP += xs(i,j);
else if (xs(i,j)<0) //ignore 0 elements: improvement for sparse matrices I think
sumN += xs(i,j);
}
return sumP+sumN;
}
Now, I would like to make this as efficient as possible, so my question is, would it be better to loop through each column of each row like the above, or do the opposite like the the following:
for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
for (size_t j = 0; j < nRows; ++j)
(I suppose this depends on the order that the matrix elements are allocated in memory, but I couldn't find this in Eigen's manual).
Also, are there other alternate ways like using iterators (do they exist in Eigen?) that might be slightly faster?
I notice that the code is equivalent to the sum of all the entries in the matrix, i.e., you could just do this:
I would assume that would perform better since it's only a single pass, and furthermore Eigen ought to "know" how to arrange the passes for optimum performance.
If, however, you want to retain the two passes, you could express this using the coefficient-wise reduction mechanisms, like this:
which uses the boolean coefficient-wise select to pick the positive and negative entries. I don't know whether it would outperform the hand-rolled loops, but in theory coding it this way allows Eigen (and the compiler) to know more about your intention, and possibly improve the results.
Eigen allocates matrices in column-major (Fortran) order by default (documentation).
The fastest way to iterate over a matrix is in storage order, doing it the wrong way round will increase the number of cache misses (which if your matrix doesn't fit in L1 will dominate your computation time, so read increase your computation time) by a factor of cacheline/elemsize (probably 64/8=8).
If your matrix fits into L1 cache this won't make a difference, but a good compiler should be able to vectorise the loop, which with AVX enabled (on a shiny new core i7) could give you a speedup of as much as 4 times. (256 bits / 64 bits).
Finally don't expect any of Eigen's built-in functions to give you a speed-up (I don't think there are iterators anyhow, but I may be mistaken), they're just going to give you the same (very simple) code.
TLDR: Swap your order of iteration, you need to vary the row index most quickly.
Try to store xs(i,j) inside a temporary variable inside the loop so you only call the function once.
I did some benchmarks to checkout which way is quicker, I got the following results (in seconds):
The first line is doing iteration as suggested by @jleahy. The second line is doing iteration as I've done in my code in the question (the inverse order of @jleahy). The third line is doing iteration using
PlainObjectBase::data()
like thisfor (int i = 0; i < matrixObject.size(); i++)
. The other 3 lines repeat the same as the above, but with an temporary as suggest by @lucas92I also did the same tests but using substituting /if else.*/ for /else/ (no special treatment for sparse matrix) and I got the following (in seconds):
Doing the tests again gave me results pretty similar. I used
g++ 4.7.3
with-O3
. The code: