I am new to C++
programming (using Rcpp
for seamless integration into R
), and I would appreciate some advice on how to speed up some calculations.
Consider the following example:
testmat <- matrix(1:9, nrow=3)
testvec <- 1:3
testmat*testvec
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 4 10 16
#[3,] 9 18 27
Here, R
recycled testvec
so that, loosely speaking, testvec
"became" a matrix of the same dimensions as testmat
for the purpose of this multiplication. Then the Hadamard product is returned. I wish to implement this behavior using Rcpp
, that is I want that each element of the i
-th row in the matrix testmat
is multiplied with the i
-th element of the vector testvec
. My benchmarks tell me that my implementations are extremely slow, and I would appreciate advise on how to speed this up. Here my code:
First, using Eigen
:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using namespace Rcpp;
using namespace Eigen;
// [[Rcpp::export]]
NumericMatrix E_matvecprod_elwise(NumericMatrix Xs, NumericVector ys){
Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));
Map<VectorXd> y(as<Map<VectorXd> >(ys));
int k = X.cols();
int n = X.rows();
MatrixXd Y(n,k) ;
// here, I emulate R's recycling. I did not find an easier way of doing this. Any hint appreciated.
for(int i = 0; i < k; ++i) {
Y.col(i) = y;
}
MatrixXd out = X.cwiseProduct(Y);
return wrap(out);
}
Here my implementation using Armadillo
(adjusted to follow Dirk's example, see answer below):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
int k = X.n_cols ;
arma::mat Y = repmat(y, 1, k) ; //
arma::mat out = X % Y;
return out;
}
Benchmarking these solutions using R, Eigen or Armadillo shows that both Eigen and Armadillo are about 2 times slower than R. Is there a way to speed these computations up or to get at least as fast as R? Are there more elegant ways of setting this up? Any advise is appreciated and welcome. (I also encourage tangential remarks about programming style in general as I am new to Rcpp / C++
.)
Here some reproducable benchmarks:
# for comparison, define R function:
R_matvecprod_elwise <- function(mat, vec) mat*vec
n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)
benchmark(R_matvecprod_elwise(X, e), A2_matvecprod_elwise(X, e), E_matvecprod_elwise(X,e),
columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
This yields
test replications elapsed relative
1 R_matvecprod_elwise(X, e) 1000 10.89 1.000
2 A_matvecprod_elwise(X, e) 1000 26.87 2.467
3 E_matvecprod_elwise(X, e) 1000 27.73 2.546
As you can see, my Rcpp
-solutions perform quite miserably. Any way to do it better?
For starters, I'd write the Armadillo version (interface) as
as you're doing an additional conversion in and out (though the
wrap()
gets added by the glue code). Theconst &
is notional (as you learned via your last question, aSEXP
is a pointer object that is lightweight to copy) but better style.You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. Your pick.
Edit: If you really want something cheap and fast, I would just do this:
which allocates no new memory and will hence be faster, and probably be competitive with R.
Test output:
My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.
Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.
Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a
NumericMatrix(numRows, numCols)
on the stack causes the code to run about 30% slower.If you're curious about usage of
restrict
, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. Therestrict
keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.Some R code to benchmark:
And the results:
Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.
Edit:
Removed the loop unrolling, as it provided no benefit and was otherwise distracting.
I'd like to build on Sameer's answer, but I don't have enough reputation to comment.
I personally got better performance (about 50%) in Eigen using:
Despite the appearance, this does not create an
n x n
temporary fory
.If you want to speed up your calculations you will have to be a little careful about not making copies. This usually means sacrificing readability. Here is a version which makes no copies and modifies matrix X inplace.
Here is what I get on my machine