is it somehow possible to conduct a linear regression for every single row of a data frame without using a loop? The output (intercept + slope) of the trend line should be added to the original data frame as new columns.
To make my intention more clearly, I have prepared a very small data example:
day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
output.intercept <- c(0,4,-1.66667)
output.slope <- c(1,-1,2)
data <- data.frame(day1,day2,day3,output.intercept,output.slope)
Input variables are day1-3; let's say those are the sales for different shops on 3 consecutive days. What I want to do is to calculate a linear trend line for the 3 rows and add the output parameters to the origin table (see output.intercept + output.slope) as new columns.
The solution should be very efficient in terms of calculation time since the real data frame has many 100k's of rows.
Best, Christoph
design.mat <- cbind(1,1:3)
response.mat <- t(data[,1:3])
reg <- lm.fit(design.mat, response.mat)$coefficients
data <- cbind(data, t(reg))
# day1 day2 day3 output.intercept output.slope x1 x2
#1 1 2 3 0.00000 1 0.000000 1
#2 3 2 1 4.00000 -1 4.000000 -1
#3 1 1 5 -1.66667 2 -1.666667 2
However, if you have massive data, it might be necessary to loop due to memory restrictions. If that's the case I would use a long format data.table and use the package's by
syntax to loop.
Using your data,
day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
output.intercept <- c(0,4,-1.66667)
output.slope <- c(1,-1,2)
dat <- data.frame(day1,day2,day3)
I think you want something like this:
fits <- lm.fit(cbind(1, seq_len(nrow(dat))), t(dat))
t(coef(fits))
Which gives
R> t(coef(fits))
x1 x2
[1,] 0.000 1
[2,] 4.000 -1
[3,] -1.667 2
These can be added to dat
like so
dat <- cbind(dat, t(coef(fits)))
names(dat)[-(1:3)] <- c("Intercept","Slope")
R> dat
day1 day2 day3 Intercept Slope
1 1 2 3 0.000 1
2 3 2 1 4.000 -1
3 1 1 5 -1.667 2
It would perhaps be easier to store the data the other way, with columns as time series, rather than rows, if you have any control over the way the data are arranged initially as it would avoid having to transpose a large matrix when fitting via lm.fit()
. Ideally, you;d want the data arranged like this initially:
[,1] [,2] [,3]
day1 1 3 1
day2 2 2 1
day3 3 1 5
I.e. the rows as the time points rather than individual series as you have them now. This is because of the way R expects data to be arranged. Note we have to transpose your dat
in the lm.fit()
call which will entail a copy of a large object. Hence if you can control how these data are arranged/supplied before they get into R, that would help for the large problem.
lm.fit()
is used as that is the underlying, lean code used by lm()
but we avoid the complexities of parsing the formula and creating model matrices. If you want more efficient, you might have to look to doing the QR decomposition yourself (the code is in lm.fit()
to do this) as there are a few things lm.fit()
does as sanity checks that you might be able to do away with if you are certain your data won't lead to singular matrices etc.
I had the same problem as OP. This solution will work on data with NAs. All of the previous answers generate an error for me in this case:
slp = function(x) {
y = t(x)
y = y[!is.na(y)]
len = length(y):1
b = cov(y,len)/var(len)
return(b)}
reg_slp <- apply(data,1,slp)
Only gets the slope, but the intercept could be easily added. I doubt this is particularly efficient, but it was effective in my case.
Or like this?
day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
data <- data.frame(day1,day2,day3)
y<-1:3
reg<-apply(data,1,function(x) lm(as.numeric(x)~y))
data[,c("intercept","slope")]<-rbind(reg[[1]]$coef,reg[[2]]$coef,reg[[3]]$coef)