-->

zeroinfl “system is computationally singular” wher

2019-07-07 17:32发布

问题:

I am trying to model count data on the number of absence days by worker in a year (dependant variable). I have a set of predictors, including information about workers, about their job, etc..., and most of them are categorical variables. Consequently, there is a large number of coefficient to estimate (83), but as I have more than 600 000 rows, I think it should not be problematic. In addition, I have no missing values in my dataset.

My dependant variable contains lot of zero values, so I would like to estimate a zero inflated model (poisson or negative binomial), with the function zeroinfl of the pscl package, with the code:

zpoisson <- zeroinfl(formule,data=train,dist = "poisson",link="logit")

but I get the following erreur after a long running time:

Error in solve.default(as.matrix(fit$hessian)) : system is computationally singular: reciprocal condition number = 1.67826e-41

I think this error means some of my covariables are correlated, but it does not seem to be the case when checking pairwise correlation and Variance Inflation Factor (VIF). Moreover, I have also estimated other models like logit and Poisson or negative binomial count models, without problems whereas these types of models are also sensitive to correlated predictors.

Do you have an idea why the zeroinfl function does not work? Could it be linked to the fact that I have too much predictors, even if they are not correlated? I have already tried to remove some predictors with the Boruta algorithm, but it kept all of them.

Thanks in advance for your help.

回答1:

  1. A collinearity among regressors is one potential cause of this error. However, there are also others.
  2. The problem may actually be computationally in the sense that the scaling of regressors is bad. Some regressor might take values in the thousands or millions and then have a tiny coefficient while other regressors take small values and have huge coefficients. This then leads to numerically instable Hessian matrices and the error above upon inversion. Typical causes include squared regressors x^2 when already x itself is large. Simply taking x/1000 or so might solve the problem.
  3. The problem may also be separation or lack of variation in the response. For example, if for certain groups or factor levels, there are only zeros the corresponding coefficient estimates might diverge and have huge standard errors. Much like in (quasi-)complete separation in binary regression.