Calculating a linear trend line for every row of a

2019-04-13 14:52发布

is it somehow possible to conduct a linear regression for every single row of a data frame without using a loop? The output (intercept + slope) of the trend line should be added to the original data frame as new columns.

To make my intention more clearly, I have prepared a very small data example:

day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
output.intercept <- c(0,4,-1.66667)
output.slope <- c(1,-1,2)
data <- data.frame(day1,day2,day3,output.intercept,output.slope)

Input variables are day1-3; let's say those are the sales for different shops on 3 consecutive days. What I want to do is to calculate a linear trend line for the 3 rows and add the output parameters to the origin table (see output.intercept + output.slope) as new columns.

The solution should be very efficient in terms of calculation time since the real data frame has many 100k's of rows.

Best, Christoph

4条回答
戒情不戒烟
2楼-- · 2019-04-13 14:57

Or like this?

day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
data <- data.frame(day1,day2,day3)
y<-1:3

reg<-apply(data,1,function(x) lm(as.numeric(x)~y))
data[,c("intercept","slope")]<-rbind(reg[[1]]$coef,reg[[2]]$coef,reg[[3]]$coef)
查看更多
Deceive 欺骗
3楼-- · 2019-04-13 15:02

I had the same problem as OP. This solution will work on data with NAs. All of the previous answers generate an error for me in this case:

slp = function(x) {
  y = t(x)
  y = y[!is.na(y)]
  len = length(y):1
  b = cov(y,len)/var(len)
  return(b)}

reg_slp <- apply(data,1,slp)

Only gets the slope, but the intercept could be easily added. I doubt this is particularly efficient, but it was effective in my case.

查看更多
【Aperson】
4楼-- · 2019-04-13 15:10

Using your data,

day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
output.intercept <- c(0,4,-1.66667)
output.slope <- c(1,-1,2)
dat <- data.frame(day1,day2,day3)

I think you want something like this:

fits <- lm.fit(cbind(1, seq_len(nrow(dat))), t(dat))
t(coef(fits))

Which gives

R> t(coef(fits))
         x1 x2
[1,]  0.000  1
[2,]  4.000 -1
[3,] -1.667  2

These can be added to dat like so

dat <- cbind(dat, t(coef(fits)))
names(dat)[-(1:3)] <- c("Intercept","Slope")

R> dat
  day1 day2 day3 Intercept Slope
1    1    2    3     0.000     1
2    3    2    1     4.000    -1
3    1    1    5    -1.667     2

It would perhaps be easier to store the data the other way, with columns as time series, rather than rows, if you have any control over the way the data are arranged initially as it would avoid having to transpose a large matrix when fitting via lm.fit(). Ideally, you;d want the data arranged like this initially:

     [,1] [,2] [,3]
day1    1    3    1
day2    2    2    1
day3    3    1    5

I.e. the rows as the time points rather than individual series as you have them now. This is because of the way R expects data to be arranged. Note we have to transpose your dat in the lm.fit() call which will entail a copy of a large object. Hence if you can control how these data are arranged/supplied before they get into R, that would help for the large problem.

lm.fit() is used as that is the underlying, lean code used by lm() but we avoid the complexities of parsing the formula and creating model matrices. If you want more efficient, you might have to look to doing the QR decomposition yourself (the code is in lm.fit() to do this) as there are a few things lm.fit() does as sanity checks that you might be able to do away with if you are certain your data won't lead to singular matrices etc.

查看更多
Root(大扎)
5楼-- · 2019-04-13 15:19
design.mat <- cbind(1,1:3)
response.mat <- t(data[,1:3])

reg <- lm.fit(design.mat, response.mat)$coefficients
data <- cbind(data, t(reg))
#  day1 day2 day3 output.intercept output.slope        x1 x2
#1    1    2    3          0.00000            1  0.000000  1
#2    3    2    1          4.00000           -1  4.000000 -1
#3    1    1    5         -1.66667            2 -1.666667  2

However, if you have massive data, it might be necessary to loop due to memory restrictions. If that's the case I would use a long format data.table and use the package's by syntax to loop.

查看更多
登录 后发表回答