I'm trying to get to grips with performing regression analyses in R. Below is some random dummy data that I have generated in R, run a logistic glm in R. I have saved the data into a test file, read that into python with ipython (ipython notebook is awesome btw, only just started using it!), and then tried to run the same analyis with python. The results are very similar but they are different. I kind of would have expected them to be the same. Have I done something wrong, is there a parameter I am missing, or the difference due to some underlying calculation?
Any help appreciated!
EDIT: I don't know if this warrants closing, but I re -ran things with Ben's edited (and cleaner) code and the results between python and R are now the same. I didn't change the python code at all, and my previous R code and Ben's code, while different should be (as far as I can see) doing the same thing. Regardless the point of the question is now moot. Nonetheless, Thanks for taking a look.
Generate random data and run glm:
set.seed(666)
dat <- data.frame(a=rnorm(500), b=runif(500),
c=as.factor(sample(1:5, 500, replace=TRUE)))
library(plyr)
dat <- mutate(dat,
y0=((jitter(a)^2+(-log10(b)))/(as.numeric(c)/10))+rnorm(500),
y=(y0>=mean(y0)))
fit1 <- glm(y~a+b+c,data=dat,family=binomial('logit'))
summary(fit1)
Call:
glm(formula = y ~ a + b + c, family = binomial("logit"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5369 -0.8154 -0.5479 0.9314 2.3831
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2363 0.3007 4.111 3.94e-05 ***
a -0.2051 0.1062 -1.931 0.0535 .
b -1.6103 0.3834 -4.200 2.67e-05 ***
c2 -0.5114 0.3091 -1.654 0.0980 .
c3 -1.3169 0.3147 -4.184 2.86e-05 ***
c4 -2.0017 0.3342 -5.990 2.09e-09 ***
c5 -2.5084 0.3772 -6.651 2.92e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 631.31 on 499 degrees of freedom
Residual deviance: 537.84 on 493 degrees of freedom
AIC: 551.84
Number of Fisher Scoring iterations: 4
output the same data for use in python:
write.table(dat, file='test.txt', row.names=F, col.names=T, quote=F, sep=',' )
and now the python code:
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
data = pd.read_csv('test.txt')
data.describe()
dummy_c = pd.get_dummies(data['c'], prefix='c')
data = data[['y','a','b']].join(dummy_c.ix[:,'c_2':])
data_depend = data['y']
data_independ = data.iloc[:,1:]
data_independ = sm.add_constant(data_independ, prepend=False)
glm_binom = sm.GLM(data_depend, data_independ, family=sm.families.Binomial())
res = glm_binom.fit()
print res.summary()
which yields:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 500
Model: GLM Df Residuals: 493
Model Family: Binomial Df Model: 6
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -268.92
Date: Sun, 27 Oct 2013 Deviance: 537.84
Time: 01:26:47 Pearson chi2: 514.
No. Iterations: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a -0.2051 0.106 -1.931 0.054 -0.413 0.003
b -1.6103 0.383 -4.200 0.000 -2.362 -0.859
c_2 -0.5114 0.309 -1.654 0.098 -1.117 0.094
c_3 -1.3169 0.315 -4.184 0.000 -1.934 -0.700
c_4 -2.0017 0.334 -5.990 0.000 -2.657 -1.347
c_5 -2.5084 0.377 -6.651 0.000 -3.248 -1.769
const 1.2363 0.301 4.111 0.000 0.647 1.826
==============================================================================