Does Quasi Separation matter in R binomial GLM?

2019-03-03 08:56发布

问题:

I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.

In my understanding, we say that the data has quasi separation when some linear combination of factor levels can completely identify failure/non-failure.

So I created an artificial dataset with a quasi separation in R as:

fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)

Then when x1=1 and x2=1 (obs 3) the data always doesn't fail. In this data, my covariate matrix has three columns: intercept, x1 and x2.

In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:

summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))

The result is:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.4342     0.1318  -3.294 0.000986 ***
x1            0.8684     0.1660   5.231 1.69e-07 ***
x2            0.8684     0.1660   5.231 1.69e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Std. Error seems very reasonable even with the quasi separation. Could anyone tell me why the quasi separation is NOT affecting the glm fit result?

回答1:

You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:

> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))

Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial, 
    data = data)

Deviance Residuals: 
[1]  0  0  0  0

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17  1.414e-01   0.000        1
x1           2.675e-17  2.000e-01   0.000        1
x2           2.965e-17  2.000e-01   0.000        1
x1:x2        2.731e+01  5.169e+04   0.001        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.2429e+02  on 3  degrees of freedom
Residual deviance: 2.7538e-10  on 0  degrees of freedom
AIC: 25.257

Number of Fisher Scoring iterations: 22

One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:

 > exp(2.731e+01)
[1] 725407933166

In this working environment there really is no material difference between Inf and 725,407,933,166.