This question already has an answer here:
For a neuroimaging application, I'm trying to fit many linear models by least squares in R (standard call to lm
). Imagine I have a design matrix X. This design matrix will be the same across all of the models. The data (Y) that is being fit will change, and as a result so will all of the fit parameters (e.g. betas, p-values, residuals, etc).
At present, I'm just sticking it in a for loop, so it's doing hundreds of thousands of calls to lm
. It seems like there's got to be a better way.
I believe the most computationally expensive piece is the matrix inversion. It looks like this gets handled with a Fortran call in lm.fit.
If I were doing this regression by hand, I'd do the matrix inversion, then just multiply it by the various datasets. In fact, I've coded up a function to do that when I have well-behaved design matrices (e.g. all continuously valued covariates). However, I really like all of the work that lm
does, like recoding my factors appropriately, etc, and the output of lm
is really nice, too.
Is there anyway to have my cake and eat it, too? Namely, to get the friendliness of lm, but use that power to computationally efficiently fit many models with identical design matrices?
Yes, there is a better way. We have been writing example replacement functions
fastLm()
based on using external C / C++ code from Armadillo, GSL and and Eigen in the packages RcppArmadillo, RcppGSL and RcppEigen.By far the largest amount of time is spent setting up the model matrix and deparsing the formula. You can read the source of
lm()
, or maybe ours infastLm()
, and see how to do this parsing just once. Keep the right-hand side, and then loop over your differenty
vectors. Which fitting function you use matters less. I likefastLm()
from RcppArmadillo, but hey, I also wrote it :)From the help page for
lm
:So it would seem that a simple approach would be to combine all the different y vectors into a matrix and pass that as the response in a single call to
lm
. For example:I do not know of a better way using
lm
; but you may want to consider using functionlsfit
. Although simpler and with less bells and whistles, the syntaxlsfit(X,y)
allows fory
to be not just a vector with values of the response variable, but also a matrix. Then a single call tolsfit
fits all columns ofy
by regressing them on the same design matrixX
. Quite fast and handy.