is it somehow possible to conduct a linear regression for every single row of a data frame without using a loop? The output (intercept + slope) of the trend line should be added to the original data frame as new columns.
To make my intention more clearly, I have prepared a very small data example:
day1 <- c(1,3,1)
day2 <- c(2,2,1)
day3 <- c(3,1,5)
output.intercept <- c(0,4,-1.66667)
output.slope <- c(1,-1,2)
data <- data.frame(day1,day2,day3,output.intercept,output.slope)
Input variables are day1-3; let's say those are the sales for different shops on 3 consecutive days. What I want to do is to calculate a linear trend line for the 3 rows and add the output parameters to the origin table (see output.intercept + output.slope) as new columns.
The solution should be very efficient in terms of calculation time since the real data frame has many 100k's of rows.
Best, Christoph
Or like this?
I had the same problem as OP. This solution will work on data with NAs. All of the previous answers generate an error for me in this case:
Only gets the slope, but the intercept could be easily added. I doubt this is particularly efficient, but it was effective in my case.
Using your data,
I think you want something like this:
Which gives
These can be added to
dat
like soIt would perhaps be easier to store the data the other way, with columns as time series, rather than rows, if you have any control over the way the data are arranged initially as it would avoid having to transpose a large matrix when fitting via
lm.fit()
. Ideally, you;d want the data arranged like this initially:I.e. the rows as the time points rather than individual series as you have them now. This is because of the way R expects data to be arranged. Note we have to transpose your
dat
in thelm.fit()
call which will entail a copy of a large object. Hence if you can control how these data are arranged/supplied before they get into R, that would help for the large problem.lm.fit()
is used as that is the underlying, lean code used bylm()
but we avoid the complexities of parsing the formula and creating model matrices. If you want more efficient, you might have to look to doing the QR decomposition yourself (the code is inlm.fit()
to do this) as there are a few thingslm.fit()
does as sanity checks that you might be able to do away with if you are certain your data won't lead to singular matrices etc.However, if you have massive data, it might be necessary to loop due to memory restrictions. If that's the case I would use a long format data.table and use the package's
by
syntax to loop.