Large fixed effects binomial regression in R

2020-05-19 07:32发布

问题:

I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The data is cross-sectional. If I can't run this regression using the normal glm function because the regression matrix seems too large for my memory (I get the message "Error: cannot allocate vector of size 22.9 Gb"). I am looking for alternative ways to run this regression on my Macbook Air (OS X 10.9.5 8GB RAM). I also have access to a server with 16GB RAM.

I have tried solving the issue in a few different ways but so far none led to satisfactory results:

lfe/felm: Using the felm regression function of the lfe package that subtracts fixed effects before running the regression. This works perfectly and allowed my to run the above regression as a normal linear model in just a few minutes. However, lfe does not support logistic regressions and glms. So felm was great for getting an idea about model fit for different models but doesn't work for the final logistic regression models.

biglm/bigglm: I thought about using bigglm to break my function into more manageable chunks. However, several sources (e.g. link1, link2, link3) mention that in order for that to work, factor levels need be consistent across chunks, i.e. each chunk must contain at least one of each factor of each factor variable. Factor A and B contain levels that only appear once, so I can't split the sets into different chunks with consistent levels. If I delete 10 factors of fixed effect A and 8 factors of B (a minor change) I will only have factors with 4+ levels left, and splitting my data into 4 chunks will make it a lot more manageable already. However, then I still need to figure out how to sort my df in a way that will ensure that my 480.000 entries are sorted into 4 chunks in which each factor level of each of the 3 factors appearing at least once.

GlmmGS/glmgs: The glmmgs function in the package with the same name performs a fixed-effects subtraction like the lfe package for logistic regressions using a "Gauss-Seidel" Algorithm. Unfortunately, the package is no longer being developed. Being relatively new to R and having no deep experience with statistics I can't make sense of the output and have no idea of how to transform it in a way that would give me the normal "effect size", "model fit", "significance interval" indicators that glm regression summaries provide.

I sent a message to the package's authors. They kindly responded as follows:

The package provides no output in the same format of a glm object. However, you can easily calculate most of the fit statistics (standard error of the estimates, goodness of fit) given the current output (in the CRAN version, I believe that the current output is a vector of estimate of coefficients, and the associated vector of standard errors; same for the covariance components, but you need not worry about them if you are fitting model without random effects). Only beware that the covariance matrices used to calculated the standard errors are the inverse of the diagonal blocks of the precision matrix associated with the Gauss-Seidel algorithm, and so they tend to underestimate the standard errors of the joint likelihood. I am not maintaining the package any longer and I do not have time to get into the specific details; the seminal theory behind the package can be found in the paper referenced in the manual, everything else needs to be worked out by you with pen and paper :).

If anyone can explain how to "easily calculate most of the fit statistics" in a way that someone without any education in statistics can understand it (might be impossible) or provide R code that shows on example of how this could be done I would be much obliged!

Revolution Analytics: I installed revolution analytics enterprise on a virtual machine that simulates Windows 7 on my Mac. The program has a function called RxLogit that is optimized for large logistic regressions. Using the RxLogit function I get the error (Failed to allocate 326554568 bytes. Error in rxCall("RxLogit", params) : bad allocation), so that function also seems too run into memory issues. However, the software enables me to run my regression on a distributed computing cluster. So I could just "kill the problem" by purchasing computing time on a cluster with lots of memory. However, I wonder if the revolution analytics program provides any formulas or methods that I don't know off that would allow me to do some kind of lfe-like fixed-effects subtraction operation or bigglm-like chunking operation that takes factors into account.

MatrixModels/glm4: One person suggested I use the glm4 function of the MatrixModels package with the sparse = TRUE attribute to speed up calculation. If I run a glm4 regression with all fixed effects I get an "Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed" error. If I run it only with the fixed effect variables B OR A and C, the calculation works and returns a "glpModel" object. As with glmmGS I have some issues in turning that output into a form that makes sense to me since the standard summary() method does not seem to work on it.

I would be happy for advice on any of the issues mentioned above or also completely different approaches for running logistic regressions with multiple large fixed effects in R with memory constraints.

回答1:

Check out

glmmboot{glmmML}

http://cran.r-project.org/web/packages/glmmML/glmmML.pdf

There is also a nice document by Brostrom and Holmberg (http://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf)

Here is the example from their document:

dat <- data.frame(y = rbinom(5000, size = 1, prob = 0.5),
               x = rnorm(5000), group = rep(1:1000, each = 5))
fit1 <- glm(y ~ factor(group) + x, data = dat, family = binomial)

require(glmmML)
fit2 <- glmmboot(y ~ x, cluster = group,data = dat)

The computing time difference is "huge"!



回答2:

For posterity, I'd also like to recommend the package speedglm, which I have found useful when trying to perform logistic regression on large data sets. It seems to use about half as much memory and finishes a lot quicker than glm.



回答3:

I agree with whoever (@Ben Bolker I guess?) suggested to you to use the glm4 function from the MatrixModels. Firstly, it solves you memory problem if you use the sparse argument. A dense design matrix with 480.000 entries and 6370 fixed effects is going to require 6371 * 480.000 * 8 = 24.464.640.000 bytes. However, your design matrix will be very sparse (many zeros) so you can do with a way smaller (in memory) design matrix if you use a sparse one. Secondly, you can exploit the sparsity to make way faster estimation.

As for options, a quick search show that the speedglm also has sparse argument although I have not tried it. I key thing with whatever method you end with is that it should use that your design matrix is sparse both to reduce computation time and to reduce the memory requirements.

The error you get (Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed" error) is likely because your design matrix is singular. In that case, your problem does not have unique solution and some option are to merge some of the group levels, use a penalization or random effect model.

You are right that it does not seem like there is a summary method for the glpModel class. Though, the slots seems to have obvious named and it should not take you long to get e.g., standard errors on your estimator, compute a variance estimate etc.