I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae
increases is there a way to truly vectorize this?
If it is any help, the only results of lm
I care about are coefficients, and some standard errors.
There's not really an easy way to vectorize this, but the pdredge
function from the MuMIn
package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel
package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~.
to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge
for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm()
. Here I would provide you a well tested one written myself, called lm.chol()
. It takes a formula
and data
, and returns:
- a coefficient summary table, as you normally see in
summary(lm(...))$coef
;
- Pearson estimate of residual standard error, as you get from
summary(lm(...))$sigma
;
- adjusted-R.squared, as you get from
summary(lm(...))$adj.r.squared
.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees
as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol()
based on Cholesky factorization with complete pivoting and lm()
based on QR factorization with partial pivoting have shrunk different coefficients to NA
. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol()
is 3 ~ 4 times faster than lm()
. If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!