lme4::glmer vs. Stata's melogit command

2020-06-16 05:28发布

问题:

Lately I have been trying to fit a lot of random effects models to relatively big datasets. Let’s say about 50,000 people (or more) observed at up to 25 time points. With such a large sample size, we include a lot of predictors that we’re adjusting for – maybe 50 or so fixed effects. I’m fitting the model to a binary outcome using lme4::glmer in R, with random intercepts for each subject. I can't go into specifics on the data, but the basic format of the glmer command I used was:

fit <-  glmer(outcome ~ treatment + study_quarter + dd_quarter + (1|id),
              family = "binomial", data = dat)

where both study_quarter and dd_quarter are factors with about 20 levels each.

When I try to fit this model in R, it runs for about 12-15 hours, and returns an error that it failed to converge. I did a bunch of troubleshooting (e.g., following these guidelines), with no improvement. And the convergence isn’t even close in the end (max gradient around 5-10, whereas the convergence criterion is 0.001 I think).

I then tried fitting the model in Stata, using the melogit command. The model fit in under 2 mins, with no convergence issues. The corresponding Stata command is

melogit outcome treatment i.study_quarter i.dd_quarter || id:

What gives? Does Stata just have a better fitting algorithm, or one better optimized for large models and large datasets? It’s really surprising how different the run times were.

回答1:

The glmer fit will probably be much faster with the optional argument nAGQ=0L. You have many fixed-effects parameters (20 levels for each of study_quarter and dd_quarter generate a total of 28 contrasts) and the default optimization method (corresponding to nAGQ=1L) puts all of those coefficients into the general nonlinear optimization call. With nAGQ=0L these coefficients are optimized within the much faster penalized iteratively reweighted least squares (PIRLS) algorithm. The default will generally provide a better estimate in the sense that the deviance at the estimate is lower, but the difference is usually very small and the time difference is enormous.

I have a write-up of the differences in these algorithms as a Jupyter notebook nAGQ.ipynb. That writeup uses the MixedModels package for Julia instead of lme4 but the methods are similar. (I am one of the authors of lme4 and the author of MixedModels.)

If you are going to be doing a lot of GLMM fits I would consider doing so in Julia with MixedModels. It is often much faster than R, even with all the complicated code in lme4.



回答2:

Are you sure that Stata is reading in the whole file?

http://www.stata.com/manuals13/rlimits.pdf

The reason I ask is because it sounds to me like you've got 50k persons observed 25 times (1,250,000 rows); depending on the version of Stata you're using you could be getting truncated results.

EDIT Since it's not a file length issue have you tried the other packages for mixed effects like nlme? I suspect that the non-linear mixed effects model would take your data somewhat faster.

EDIT This resource may be more helpful than anything about the different models: https://stats.stackexchange.com/questions/173813/r-mixed-models-lme-lmer-or-both-which-one-is-relevant-for-my-data-and-why