Why does lm run out of memory while matrix multipl

2019-01-24 07:14发布

问题:

I am trying to do fixed effects linear regression with R. My data looks like

dte   yr   id   v1   v2
  .    .    .    .    .
  .    .    .    .    .
  .    .    .    .    .

I then decided to simply do this by making yr a factor and use lm:

lm(v1 ~ factor(yr) + v2 - 1, data = df)

However, this seems to run out of memory. I have 20 levels in my factor and df is 14 million rows which takes about 2GB to store, I am running this on a machine with 22 GB dedicated to this process.

I then decided to try things the old fashioned way: create dummy variables for each of my years t1 to t20 by doing:

df$t1 <- 1*(df$yr==1)
df$t2 <- 1*(df$yr==2)
df$t3 <- 1*(df$yr==3)
...

and simply compute:

solve(crossprod(x), crossprod(x,y))

This runs without a problem and produces the answer almost right away.

I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine? Thanks.

回答1:

None of the answers so far has pointed to the right direction.

The accepted answer by @idr is making confusion between lm and summary.lm. lm computes no diagnostic statistics at all; instead, summary.lm does. So he is talking about summary.lm.

@Jake's answer is a fact on the numeric stability of QR factorization and LU / Choleksy factorization. Aravindakshan's answer expands this, by pointing out the amount of floating point operations behind both operations (though as he said, he did not count in the costs for computing matrix cross product). But, do not confuse FLOP counts with memory costs. Actually both method have the same memory usage in LINPACK / LAPACK. Specifically, his argument that QR method costs more RAM to store Q factor is a bogus one. The compacted storage as explained in lm(): What is qraux returned by QR decomposition in LINPACK / LAPACK clarifies how QR factorization is computed and stored. Speed issue of QR v.s. Chol is detailed in my answer: Why the built-in lm function is so slow in R?, and my answer on faster lm provides a small routine lm.chol using Choleksy method, which is 3 times faster than QR method.

@Greg's answer / suggestion for biglm is good, but it does not answer the question. Since biglm is mentioned, I would point out that QR decomposition differs in lm and biglm. biglm computes householder reflection so that the resulting R factor has positive diagonals. See Cholesky factor via QR factorization for details. The reason that biglm does this, is that the resulting R will be as same as the Cholesky factor, see QR decomposition and Choleski decomposition in R for information. Also, apart from biglm, you can use mgcv. Read my answer: biglm predict unable to allocate a vector of size xx.x MB for more.


After a summary, it is time to post my answer.

In order to fit a linear model, lm will

  1. generates a model frame;
  2. generates a model matrix;
  3. call lm.fit for QR factorization;
  4. returns the result of QR factorization as well as the model frame in lmObject.

You said your input data frame with 5 columns costs 2 GB to store. With 20 factor levels the resulting model matrix has about 25 columns taking 10 GB storage. Now let's see how memory usage grows when we call lm.

  • [global environment] initially you have 2 GB storage for the data frame;
  • [lm envrionment] then it is copied to a model frame, costing 2 GB;
  • [lm environment] then a model matrix is generated, costing 10 GB;
  • [lm.fit environment] a copy of model matrix is made then overwritten by QR factorization, costing 10 GB;
  • [lm environment] the result of lm.fit is returned, costing 10 GB;
  • [global environment] the result of lm.fit is further returned by lm, costing another 10 GB;
  • [global environment] the model frame is returned by lm, costing 2 GB.

So, a total of 46 GB RAM is required, far greater than your available 22 GB RAM.

Actually if lm.fit can be "inlined" into lm, we could save 20 GB costs. But there is no way to inline an R function in another R function.

Maybe we can take a small example to see what happens around lm.fit:

X <- matrix(rnorm(30), 10, 3)    # a `10 * 3` model matrix
y <- rnorm(10)    ## response vector

tracemem(X)
# [1] "<0xa5e5ed0>"

qrfit <- lm.fit(X, y)
# tracemem[0xa5e5ed0 -> 0xa1fba88]: lm.fit 

So indeed, X is copied when passed into lm.fit. Let's have a look at what qrfit has

str(qrfit)
#List of 8
# $ coefficients : Named num [1:3] 0.164 0.716 -0.912
#  ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
# $ residuals    : num [1:10] 0.4 -0.251 0.8 -0.966 -0.186 ...
# $ effects      : Named num [1:10] -1.172 0.169 1.421 -1.307 -0.432 ...
#  ..- attr(*, "names")= chr [1:10] "x1" "x2" "x3" "" ...
# $ rank         : int 3
# $ fitted.values: num [1:10] -0.466 -0.449 -0.262 -1.236 0.578 ...
# $ assign       : NULL
# $ qr           :List of 5
#  ..$ qr   : num [1:10, 1:3] -1.838 -0.23 0.204 -0.199 0.647 ...
#  ..$ qraux: num [1:3] 1.13 1.12 1.4
#  ..$ pivot: int [1:3] 1 2 3
#  ..$ tol  : num 1e-07
#  ..$ rank : int 3
#  ..- attr(*, "class")= chr "qr"
# $ df.residual  : int 7

Note that the compact QR matrix qrfit$qr$qr is as large as model matrix X. It is created inside lm.fit, but on exit of lm.fit, it is copied. So in total, we will have 3 "copies" of X:

  • the original one in global environment;
  • the one copied into lm.fit, the overwritten by QR factorization;
  • the one returned by lm.fit.

In your case, X is 10 GB, so the memory costs associated with lm.fit alone is already 30 GB. Let alone other costs associated with lm.


On the other hand, let's have a look at

solve(crossprod(X), crossprod(X,y))

X takes 10 GB, but crossprod(X) is only a 25 * 25 matrix, and crossprod(X,y) is just a length-25 vector. They are so tiny compared with X, thus memory usage does not increase at all.

Maybe you are worried that a local copy of X will be made when crossprod is called? Not at all! Unlike lm.fit which performs both read and write to X, crossprod only reads X, so no copy is made. We can verify this with our toy matrix X by:

tracemem(X)
crossprod(X)

You will see no copying message!


If you want a short summary for all above, here it is:

  • memory costs for lm.fit(X, y) (or even .lm.fit(X, y)) is three times as large as that for solve(crossprod(X), crossprod(X,y));
  • Depending on how much larger the model matrix is than the model frame, memory costs for lm is 3 ~ 6 times as large as that for solve(crossprod(X), crossprod(X,y)). The lower bound 3 is never reached, while the upper bound 6 is reached when the model matrix is as same as the model frame. This is the case when there is no factor variables or "factor-alike" terms, like bs() and poly(), etc.


回答2:

In addition to what idris said, it's also worth pointing out that lm() does not solve for the parameters using the normal equations like you illustrated in your question, but rather uses QR decomposition, which is less efficient but tends to produce more numerically accurate solutions.



回答3:

You might want to consider using the biglm package. It fits lm models by using smaller chunks of data.



回答4:

lm does much more than just find the coefficients for your input features. For example, it provides diagnostic statistics that tell you more about the coefficients on your independent variables including the standard error and t value of each of your independent variables.

I think that understanding these diagnostic statistics is important when running regressions to understand how valid your regression is.

These additional calculations cause lm to be slower than simply doing solving the matrix equations for the regression.

For example, using the mtcars dataset:

>data(mtcars)
>lm_cars <- lm(mpg~., data=mtcars)
>summary(lm_cars)

Call:                                                         
lm(formula = mpg ~ ., data = mtcars)                          

Residuals:                                                    
    Min      1Q  Median      3Q     Max                       
-3.4506 -1.6044 -0.1196  1.2193  4.6271                       

Coefficients:                                                 
            Estimate Std. Error t value Pr(>|t|)              
(Intercept) 12.30337   18.71788   0.657   0.5181              
cyl         -0.11144    1.04502  -0.107   0.9161              
disp         0.01334    0.01786   0.747   0.4635              
hp          -0.02148    0.02177  -0.987   0.3350              
drat         0.78711    1.63537   0.481   0.6353              
wt          -3.71530    1.89441  -1.961   0.0633 .            
qsec         0.82104    0.73084   1.123   0.2739              
vs           0.31776    2.10451   0.151   0.8814              
am           2.52023    2.05665   1.225   0.2340              
gear         0.65541    1.49326   0.439   0.6652              
carb        -0.19942    0.82875  -0.241   0.8122              
---                                                           
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom        
Multiple R-squared: 0.869,      Adjusted R-squared: 0.8066    
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07       


回答5:

To elaborate on Jake's point. Let's say your regression is trying to solve: y = Ax ( A is the design matrix ). With m observations and n independent variables A is a mxn matrix. Then cost of QR is ~ m*n^2. In your case it looks like m = 14x10^6 and n = 20 . So m*n^2 = 14*10^6*400 is a significant cost.

However with the normal equations you are trying to invert A'A (' indicates transpose ), which is square and of size nxn. The solve is usually done using LU which costs n^3 = 8000. This is much smaller than the computational cost of QR. Of course this doesn't include the cost of the matrix multiply.

Further if the QR routine tries to store the Q matrix which is of size mxm=14^2*10^12 (!), then your memory will be insufficient. It is possible to write QR to not have this problem. It would be interesting to know what version of QR, is actually being used. And why exactly the lm call runs out of memory.