Difference in GLM results between iPython and R

2019-04-10 06:46发布

问题:

I'm trying to get to grips with performing regression analyses in R. Below is some random dummy data that I have generated in R, run a logistic glm in R. I have saved the data into a test file, read that into python with ipython (ipython notebook is awesome btw, only just started using it!), and then tried to run the same analyis with python. The results are very similar but they are different. I kind of would have expected them to be the same. Have I done something wrong, is there a parameter I am missing, or the difference due to some underlying calculation?

Any help appreciated!

EDIT: I don't know if this warrants closing, but I re -ran things with Ben's edited (and cleaner) code and the results between python and R are now the same. I didn't change the python code at all, and my previous R code and Ben's code, while different should be (as far as I can see) doing the same thing. Regardless the point of the question is now moot. Nonetheless, Thanks for taking a look.

Generate random data and run glm:

set.seed(666)
dat <- data.frame(a=rnorm(500), b=runif(500), 
                  c=as.factor(sample(1:5, 500, replace=TRUE)))
library(plyr)
dat <- mutate(dat,
              y0=((jitter(a)^2+(-log10(b)))/(as.numeric(c)/10))+rnorm(500),
              y=(y0>=mean(y0)))                  

fit1 <- glm(y~a+b+c,data=dat,family=binomial('logit'))
summary(fit1)

Call:
glm(formula = y ~ a + b + c, family = binomial("logit"), data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5369  -0.8154  -0.5479   0.9314   2.3831  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.2363     0.3007   4.111 3.94e-05 ***
a            -0.2051     0.1062  -1.931   0.0535 .  
b            -1.6103     0.3834  -4.200 2.67e-05 ***
c2           -0.5114     0.3091  -1.654   0.0980 .  
c3           -1.3169     0.3147  -4.184 2.86e-05 ***
c4           -2.0017     0.3342  -5.990 2.09e-09 ***
c5           -2.5084     0.3772  -6.651 2.92e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 631.31  on 499  degrees of freedom
Residual deviance: 537.84  on 493  degrees of freedom
AIC: 551.84

Number of Fisher Scoring iterations: 4

output the same data for use in python:

write.table(dat, file='test.txt', row.names=F, col.names=T, quote=F, sep=',' )

and now the python code:

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

data = pd.read_csv('test.txt')
data.describe()
dummy_c = pd.get_dummies(data['c'], prefix='c')
data = data[['y','a','b']].join(dummy_c.ix[:,'c_2':])
data_depend = data['y']
data_independ = data.iloc[:,1:]
data_independ = sm.add_constant(data_independ, prepend=False)
glm_binom = sm.GLM(data_depend, data_independ, family=sm.families.Binomial())
res = glm_binom.fit()
print res.summary()

which yields:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  500
Model:                            GLM   Df Residuals:                      493
Model Family:                Binomial   Df Model:                            6
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -268.92
Date:                Sun, 27 Oct 2013   Deviance:                       537.84
Time:                        01:26:47   Pearson chi2:                     514.
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
a             -0.2051      0.106     -1.931      0.054        -0.413     0.003
b             -1.6103      0.383     -4.200      0.000        -2.362    -0.859
c_2           -0.5114      0.309     -1.654      0.098        -1.117     0.094
c_3           -1.3169      0.315     -4.184      0.000        -1.934    -0.700
c_4           -2.0017      0.334     -5.990      0.000        -2.657    -1.347
c_5           -2.5084      0.377     -6.651      0.000        -3.248    -1.769
const          1.2363      0.301      4.111      0.000         0.647     1.826
==============================================================================

回答1:

A little bit of code cleanup:

set.seed(101)
dat <- data.frame(a=rnorm(500), b=runif(500), 
          c=as.factor(sample(1:5, 500, replace=TRUE)))
library(plyr)
dat <- mutate(dat,
         y0=((jitter(a)^2+(-log10(b)))/(as.numeric(c)/10))+rnorm(500),
         y=(y0>=mean(y0)))                  

fit1 <- glm(y~a+b+c,data=dat,family=binomial('logit'))
fit2 <- update(fit1,control=glm.control(maxit=6))
all.equal(fit1,fit2)

coef(fit1)
## (Intercept)           a           b          c2          c3          c4 
##  1.22283193 -0.07544488 -1.54732712 -0.36477556 -1.46313143 -1.95008291 
##          c5 
## -3.11914945

I agree with @Roland's comment that a reproducible example will help. The most likely difference is in contrast coding, e.g.:

fit3 <- update(fit1,contrasts=list(c=contr.sum))
coef(fit3)
## 
## (Intercept)           a           b          c1          c2          c3 
## -0.15659594 -0.07544488 -1.54732712  1.37942787  1.01465231 -0.08370356 
##          c4 
## -0.57065503 

If you use a model with only continuous predictors, do the results match better?

Update: contrast coding can't be the whole story because the deviance/log-likelihood as well as the coefficients differ.