How to speed up GLM estimation?

2020-02-05 08:04发布

问题:

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.

I am trying to fit a logistic regression with approximately 1500 parameters.

R is using 7% CPU and has 60+GB memory and is still taking a very long time.

Here is the code:

glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))), 
  family = binomial(logit), data = df[1:150000,])

Any suggestions to speed this up by a significant amount?

回答1:

You could try the speedglm function from the speedglm package. I haven't used it on problems as large as you describe, but especially if you install a BLAS library (as @Ben Bolker suggested in the comments) it should be easy to use and give you a nice speed bump.

I remember seeing a table benchmarking glm and speedglm with and without a BLAS library, but I can't seem to find it today. I remember that the convinced me that I want both BLAS and speedglm.



回答2:

Although a bit late but I can only encourage dickoa's suggestion to generate a sparse model matrix using the Matrix package and then feeding this to the speedglm.wfit function. That works great ;-) This way, I was able to run a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.



回答3:

Assuming that your design matrix is not sparse, then you can also consider my package parglm. See this vignette for a comparison of computation times and further details. I show a comparison here of computation times on a related question.

One of the methods in the parglm function works as the bam function in mgcv. The method is described in detail in

Wood, S.N., Goude, Y. & Shaw S. (2015) Generalized additive models for large datasets. Journal of the Royal Statistical Society, Series C 64(1): 139-155.

On advantage of the method is that one can implement it with non-concurrent QR implementation and still do the computation in parallel. Another advantage is a potentially lower memory footprint. This is used in mgcv's bam function and could also be implemented here with a setup as in speedglm's shglm function.