I want to compute ordinary least square (OLS) estimates in R without using "lm", and this for several reasons. First, "lm" also computes lots of stuff I don't need (such as the fitted values) considering that data size is an issue in my case. Second, I want to be able to implement OLS myself in R before doing it in another language (eg. in C with the GSL).
As you may know, the model is: Y=Xb+E; with E ~ N(0, sigma^2). As detailed below, b is a vector with 2 parameters, the mean (b0) and another coefficients (b1). At the end, for each linear regression I will do, I want the estimate for b1 (effect size), its standard error, the estimate for sigma^2 (residual variance), and R^2 (determination coef).
Here are my data. I have N samples (eg. individuals, N~=100). For each sample, I have Y outputs (response variables, Y~=10^3) and X points (explanatory variables, X~=10^6). I want to treat the Y outputs separately, ie. I want to launch Y linear regressions: one for output 1, one for output 2, etc. Moreover, I want to use explanatory variables one y one: for output 1, I want to regress it on point 1, then on point 2, then ... finally on point X. (I hope it's clear...!)
Here is my R code to check the speed of "lm" versus computing OLS estimates myself via matrix algebra.
First, I simulate dummy data:
nb.samples <- 10 # N
nb.points <- 1000 # X
x <- matrix(data=replicate(nb.samples,sample(x=0:2,size=nb.points, replace=T)),
nrow=nb.points, ncol=nb.samples, byrow=F,
dimnames=list(points=paste("p",seq(1,nb.points),sep=""),
samples=paste("s",seq(1,nb.samples),sep="")))
nb.outputs <- 10 # Y
y <- matrix(data=replicate(nb.outputs,rnorm(nb.samples)),
nrow=nb.samples, ncol=nb.outputs, byrow=T,
dimnames=list(samples=paste("s",seq(1,nb.samples),sep=""),
outputs=paste("out",seq(1,nb.outputs),sep="")))
Here is my own function used just below:
GetResFromCustomLinReg <- function(Y, xi){ # both Y and xi are N-dim vectors
n <- length(Y)
X <- cbind(rep(1,n), xi) #
p <- 1 # nb of explanatory variables, besides the mean
r <- p + 1 # rank of X: nb of indepdt explanatory variables
inv.XtX <- solve(t(X) %*% X)
beta.hat <- inv.XtX %*% t(X) %*% Y
Y.hat <- X %*% beta.hat
E.hat <- Y - Y.hat
E2.hat <- (t(E.hat) %*% E.hat)
sigma2.hat <- (E2.hat / (n - r))[1,1]
var.covar.beta.hat <- sigma2.hat * inv.XtX
se.beta.hat <- t(t(sqrt(diag(var.covar.beta.hat))))
Y.bar <- mean(Y)
R2 <- 1 - (E2.hat) / (t(Y-Y.bar) %*% (Y-Y.bar))
return(c(beta.hat[2], se.beta.hat[2], sigma2.hat, R2))
}
Here is my code using the built-in "lm":
res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)})
Here is my custom OLS code:
res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)})
When I run this example with the values given above, I get:
> system.time( res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)}) )
user system elapsed
2.526 0.000 2.528
> system.time( res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
user system elapsed
4.561 0.000 4.561
(And, naturally, it gets worse when increasing N, X and Y.)
Of course, "lm" has the nice property of "automatically" fitting separately each column of the response matrix (y~xi), while I have to use "apply" Y times (for each yi~xi). But is this the only reason why my code is slower? Does one of you know how to improve this?
(Sorry for such a long question, but I really tried to provide a minimal, yet comprehensive example.)
> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-redhat-linux-gnu (64-bit)