Rcpp implementation of mvtnorm::pmvnorm slower tha

I am trying to get a Rcpp version of pmvnorm to work at least as fast as mvtnorm::pmvnorm in R.

I have found https://github.com/zhanxw/libMvtnorm and created a Rcpp skeleton package with the relevant source files. I have added the following functions which make use of Armadillo (since I'm using it across other code I've been writing).

//[[Rcpp::export]]
arma::vec triangl(const arma::mat& X){
  arma::mat LL = arma::trimatl(X, -1);  // omit the main diagonal
  return LL.elem(arma::find(LL != 0));
}

//[[Rcpp::export]]
double pmvnorm_cpp(arma::vec& bound, arma::vec& lowtrivec){
  double error;
  int n = bound.n_elem;
  double* boundptr = bound.memptr();
  double* lowtrivecptr = lowtrivec.memptr();
  double result = pmvnorm_P(n, boundptr, lowtrivecptr, &error);
  return result;
}

From R after building the package, this is a reproducible example:

set.seed(1)
covar <- rWishart(1, 10, diag(5))[,,1]
sds <- diag(covar) ^-.5
corrmat <- diag(sds) %*% covar %*% diag(sds)
triang <- triangl(corrmat)

bounds <- c(0.5, 0.9, 1, 4, -1)

rbenchmark::benchmark(pmvnorm_cpp(bounds, triang),
                      mvtnorm::pmvnorm(upper=bounds, corr = corrmat),
                      replications=1000)

Which shows that pmvnorm_cpp is much slower than mvtnorm::pmvnorm. and the result is different.

> pmvnorm_cpp(bounds, triang)
[1] 0.04300643
> mvtnorm::pmvnorm(upper=bounds, corr = corrmat)
[1] 0.04895361

which puzzles me because I thought the base fortran code was the same. Is there something in my code that makes everything go slow? Or should I try to port the mvtnorm::pmvnorm code directly? I have literally no experience with fortran.

Suggestions appreciated, excuse my incompetence othewise.

EDIT: to make a quick comparison with an alternative, this:

//[[Rcpp::export]]
NumericVector pmvnorm_cpp(NumericVector bound, NumericMatrix cormat){
  Environment stats("package:mvtnorm"); 
  Function f = stats["pmvnorm"];

  NumericVector lower(bound.length(), R_NegInf);
  NumericVector mean(bound.length());
  NumericVector res = f(lower, bound, mean, cormat);
  return res;
}

has essentially the same performance as an R call (the following on a 40-dimensional mvnormal):

> rbenchmark::benchmark(pmvnorm_cpp(bounds, corrmat),
+                       mvtnorm::pmvnorm(upper=bounds, corr = corrmat),
+                       replications=100)
                                              test replications elapsed relative user.self sys.self
2 mvtnorm::pmvnorm(upper = bounds, corr = corrmat)          100   16.86    1.032     16.60     0.00
1                     pmvnorm_cpp(bounds, corrmat)          100   16.34    1.000     16.26     0.01

so it seems to me there must be something going on in the previous code. either with how I'm handling things with Armadillo, or how the other things are connected. I would assume that there should be a performance gain compared to this last implementation.

Instead of trying to use an additional library for this, I would try to use the C API exported by mvtnorm, c.f. https://github.com/cran/mvtnorm/blob/master/inst/NEWS#L44-L48. While doing so, I found three reasons why the results differ. One of them is also responsible for the preformance difference:

mvtnorm uses R's RNG, while this has been removed from the library you are using, c.f. https://github.com/zhanxw/libMvtnorm/blob/master/libMvtnorm/randomF77.c.
Your triangl function is incorrect. It returns the lower triangular matrix in column-major order. However, the underlying fortran code expects it in row-major order, c.f. https://github.com/cran/mvtnorm/blob/master/src/mvt.f#L36-L39 and https://github.com/zhanxw/libMvtnorm/blob/master/libMvtnorm/mvtnorm.cpp#L60
libMvtnorm uses 1e-6 instead of 1e-3 as relative precision, c.f. https://github.com/zhanxw/libMvtnorm/blob/master/libMvtnorm/mvtnorm.cpp#L65. This is also responsible for the performance difference.

We can test this using the following code:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::depends(mvtnorm)]]
#include <mvtnormAPI.h>

//[[Rcpp::export]]
arma::vec triangl(const arma::mat& X){
  int n = X.n_cols;
  arma::vec res(n * (n-1) / 2);
  for (int i = 0; i < n; ++i) {
    for (int j = 0; j < i; ++j) {
      res(j + i * (i-1) / 2) = X(i, j);
    }
  }
  return res;
}

// [[Rcpp::export]]
double pmvnorm_cpp(arma::vec& bound,
           arma::vec& lowertrivec,
           double abseps = 1e-3){

  int n = bound.n_elem;
  int nu = 0;
  int maxpts = 25000;     // default in mvtnorm: 25000
  double releps = 0;      // default in mvtnorm: 0
  int rnd = 1;            // Get/PutRNGstate

  double* bound_ = bound.memptr();
  double* correlationMatrix = lowertrivec.memptr();
  double* lower = new double[n];
  int* infin = new int[n];
  double* delta = new double[n];

  for (int i = 0; i < n; ++i) {
    infin[i] = 0; // (-inf, bound]
    lower[i] = 0.0;
    delta[i] = 0.0;
  }

  // return values
  double error;
  double value;
  int inform;

  mvtnorm_C_mvtdst(&n, &nu, lower, bound_,
           infin, correlationMatrix, delta,
           &maxpts, &abseps, &releps,
           &error, &value, &inform, &rnd);
  delete[] (lower);
  delete[] (infin);
  delete[] (delta);

  return value;
}

/*** R
set.seed(1)
covar <- rWishart(1, 10, diag(5))[,,1]
sds <- diag(covar) ^-.5
corrmat <- diag(sds) %*% covar %*% diag(sds)
triang <- triangl(corrmat)
bounds <- c(0.5, 0.9, 1, 4, -1)
set.seed(1)
system.time(cat(mvtnorm::pmvnorm(upper=bounds, corr = corrmat), "\n"))
set.seed(1)
system.time(cat(pmvnorm_cpp(bounds, triang, 1e-6), "\n"))
set.seed(1)
system.time(cat(pmvnorm_cpp(bounds, triang, 0.001), "\n"))
 */

Results:

> system.time(cat(mvtnorm::pmvnorm(upper=bounds, corr = corrmat), "\n"))
0.04896221 
   user  system elapsed 
  0.000   0.003   0.003 

> system.time(cat(pmvnorm_cpp(bounds, triang, 1e-6), "\n"))
0.04895756 
   user  system elapsed 
  0.035   0.000   0.035 

> system.time(cat(pmvnorm_cpp(bounds, triang, 0.001), "\n"))
0.04896221 
   user  system elapsed 
  0.004   0.000   0.004

With the same RNG (and RNG state), the correct lower triangular correlation matrix and the same relative precision, results are identical and performance is comparable. With higher precision, performance suffers.

All this is for a stand-alone file using Rcpp::sourceCpp. In order to use this in a package, you need to add LinkingTo: mvtnorm to your DESCRIPTION file.