I always thought that the lm
function was extremely fast in R, but as this example would suggest, the closed solution computed using the solve
function is way faster.
data<-data.frame(y=rnorm(1000),x1=rnorm(1000),x2=rnorm(1000))
X = cbind(1,data$x1,data$x2)
library(microbenchmark)
microbenchmark(
solve(t(X) %*% X, t(X) %*% data$y),
lm(y ~ .,data=data))
Can someone explain me if this toy example is a bad example or it is the case that lm
is actually slow?
EDIT: As suggested by Dirk Eddelbuettel, as lm
needs to resolve the formula, the comparison is unfair, so better to use lm.fit
which doesn't need to resolve the formula
microbenchmark(
solve(t(X) %*% X, t(X) %*% data$y),
lm.fit(X,data$y))
Unit: microseconds
expr min lq mean median uq max neval cld
solve(t(X) %*% X, t(X) %*% data$y) 99.083 108.754 125.1398 118.0305 131.2545 236.060 100 a
lm.fit(X, y) 125.136 136.978 151.4656 143.4915 156.7155 262.114 100 b
You are overlooking that
solve()
only returns your parameters
lm()
returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...
- the main cost of your
lm()
call is not the projection but the resolution of the formula y ~ .
from which the model matrix needs to be built.
To illustrate Rcpp we wrote a few variants of a function fastLm()
doing more of what lm()
does (ie a bit more than lm.fit()
from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.
In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.
Something wrong with your benchmarking
It is really surprising that no one observed this!
You have used t(X) %*% X
inside solve()
. You should use crossprod(X)
, as X'X
is a symmetric matrix. crossprod()
only computes half of the matrix while copying the rest. %*%
forces computing all. So crossprod()
will be two times faster. This explains why in your benchmarking, you have roughly the same timing between solve()
and lm.fit()
.
On my old Intel Nahalem (2008 Intel Core 2 Duo), I have:
X <- matrix(runif(1000*1000),1000)
system.time(t(X)%*%X)
# user system elapsed
# 2.320 0.000 2.079
system.time(crossprod(X))
# user system elapsed
# 1.22 0.00 0.94
If your computer is faster, try using X <- matrix(runif(2000*2000),2000)
instead.
In the following, I will explain the computational details involved in all fitting methods.
QR factorization v.s. Cholesky factorization
lm()
/ lm.fit()
is QR based, while solve()
is Cholesky based. The computational costs of QR decomposition is 2 * n * p^2
, while Cholesky method is n * p^2 + p^3
(n * p^2
for computing matrix cross product, p^3
for Cholesky decomposition). So you can see that when n
is much much greater than p
, Cholesky method is about 2 times faster than QR method. So there is really no need to benchmark here. (in case you don't know, n
is the number of data, p
is the number of parameters.)
LINPACK QR v.s. LAPACK QR
Typically, lm.fit()
uses (modified) LINPACK
QR factorization algorithm, rather than LAPACK
QR factorization algorithm. Maybe you are not very familiar with BLAS/LINPACK/LAPACK
; these are FORTRAN code providing kernel scientific matrix computations. LINPACK
calls level-1 BLAS, while LAPACK
calls level-3 BLAS
using block algorithms. On average, LAPACK
QR is 1.6 times faster than LINPACK
QR. The critical reason that lm.fit()
does not use LAPACK
version, is the need for partial column pivoting. LAPACK
version does full column pivoting, making it more difficult for summary.lm()
to use the R
matrix factor of QR decomposition to produce F-statistic and ANOVA
test.
pivoting v.s. no pivoting
fastLm()
from RcppEigen
package uses LAPACK
non-pivoted QR factorization. Again, you may be unclear about the QR factorization algorithm and pivoting issues. You only need to know that QR factorization with pivoting has only 50% share of level-3 BLAS
, while QR factorization without pivoting has 100% share of level-3 BLAS
. In this regard, giving up pivoting will speed up QR factorization process. Surely, the final result is different, and when the model matrix is rank deficient, no pivoting gives dangerous result.
There is a good question related to the different result you get from fastLM
: Why does fastLm()
return results when I run a regression with one observation?. @BenBolker, @DirkEddelbuettel and I had a very brief discussion in comments of Ben's answer.
Conclusion: Do you want speed, or numerical stability?
In terms of numerical stability, there is:
LINPACK pivoted QR > LAPACK pivoted QR > pivoted Cholesky > LAPACK non-pivoted QR
In terms of speed, there is:
LINPACK pivoted QR < LAPACK pivoted QR < pivoted Cholesky < LAPACK non-pivoted QR
As Dirk said,
FWIW the RcppEigen package has a fuller set of decompositions in its fastLm()
example. But here it is as Ben so eloquently stated: "this is part of the price you pay for speed.". We give you enough rope to hang yourself. If you want to protect yourself from yourself, stick with lm()
or lm.fit()
, or create a hybrid 'fast-but-safe' version.
Fast and stable version
Check my answer Here.