I have an issue finding the most efficient way to calculate a rolling linear regression over a xts object with multiple columns. I have searched and read several previously questions here on stackoverflow.
This question and answer comes close but not enough in my opinion as I want to calculate multiple regressions with the dependent variable unchanged in all the regressions. I have tried to reproduce an example with random data:
require(xts)
require(RcppArmadillo) # Load libraries
data <- matrix(sample(1:10000, 1500), 1500, 5, byrow = TRUE) # Random data
data[1000:1500, 2] <- NA # insert NAs to make it more similar to true data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
info.names <- c("res", "coef")
info <- array(NA, dim = c(NR, length(info.names), NC))
colnames(info) <- info.names
The array is created in order to store multiple variables (residuals, coefficients etc.) over time and per factor.
loop.begin.time <- Sys.time()
for (j in 2:NC) {
cat(paste("Processing residuals for factor:", j), "\n")
for (i in obs:NR) {
regression.temp <- fastLm(data[i:(i-(obs-1)), j] ~ data[i:(i-(obs-1)), 1])
residuals.temp <- regression.temp$residuals
info[i, "res", j] <- round(residuals.temp[1] / sd(residuals.temp), 4)
info[i, "coef", j] <- regression.temp$coefficients[2]
}
}
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
As the loop shows the idea is to run a 30 observations rolling regression with data[, 1]
as the dependent variable (factor) every time against one of the other factors. I have to store the 30 residuals in a temporary object in order to standardize them as fastLm
does not calculate standardized residuals.
The loop is extremely slow and becomes a cumbersome if the numbers of columns (factors) in the xts object increases to around 100 - 1,000 columns would take an eternity. I hope one has a more efficient code to create rolling regressions over a large data set.
Here is a much faster way to do it with the
rollRegres
packageDo notice this comment inside the above solution
Here is a comparison to Sameer's answer
The time includes the time used to compute the standardized residuals.
It should be pretty quick if you go down to level of the math of the linear regression. If X is the independent variable and Y is the dependent variable. The coefficients are given by
Beta = inv(t(X) %*% X) %*% (t(X) %*% Y)
I'm a little confused about which variable you want to be the dependent and which one is the independent but hopefully solving a similar problem below will help you as well.
In the example below I take 1000 variables instead of the original 5 and do not introduce any NA's.
Now we can calculate the coefficients using Joshua's TTR package.
Time difference of 3.934461 secs
If I haven't made any errors in the indexing
res.sd
should give you the standardized residuals. Please feel free to fix this solution to correct any bugs.