I was running logistic regression on iris dataset on both R and Python.But both are giving different results(coefficients,intercept and scores).
#Python codes.
In[23]: iris_df.head(5)
Out[23]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
In[35]: iris_df.shape
Out[35]: (100, 5)
#looking at the levels of the Species dependent variable..
In[25]: iris_df['Species'].unique()
Out[25]: array([0, 1], dtype=int64)
#creating dependent and independent variable datasets..
x = iris_df.ix[:,0:4]
y = iris_df.ix[:,-1]
#modelling starts..
y = np.ravel(y)
logistic = LogisticRegression()
model = logistic.fit(x,y)
#getting the model coefficients..
model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
model_intercept = model.intercept_
In[30]: model_coef
Out[36]:
0 1
0 Sepal.Length [-0.402473917528]
1 Sepal.Width [-1.46382924771]
2 Petal.Length [2.23785647964]
3 Petal.Width [1.0000929404]
In[31]: model_intercept
Out[31]: array([-0.25906453])
#scores...
In[34]: logistic.predict_proba(x)
Out[34]:
array([[ 0.9837306 , 0.0162694 ],
[ 0.96407227, 0.03592773],
[ 0.97647105, 0.02352895],
[ 0.95654126, 0.04345874],
[ 0.98534488, 0.01465512],
[ 0.98086592, 0.01913408],
R codes.
> str(irisdf)
'data.frame': 100 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : int 0 0 0 0 0 0 0 0 0 0 ...
> model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.681e-05 -2.110e-08 0.000e+00 2.110e-08 2.006e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.556 601950.324 0 1
Sepal.Length -9.879 194223.245 0 1
Sepal.Width -7.418 92924.451 0 1
Petal.Length 19.054 144515.981 0 1
Petal.Width 25.033 216058.936 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 1.3166e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
Due to convergence problem,i increased the maximum iteration and gave epsilon as 0.05.
> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf,
control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.0102793 -0.0005659 -0.0000052 0.0001438 0.0112531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.796 704.352 0.003 0.998
Sepal.Length -3.426 215.912 -0.016 0.987
Sepal.Width -4.208 123.513 -0.034 0.973
Petal.Length 7.615 159.478 0.048 0.962
Petal.Width 11.835 285.938 0.041 0.967
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 5.3910e-04 on 95 degrees of freedom
AIC: 10.001
Number of Fisher Scoring iterations: 12
#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
1 2 3 4 5
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08
Both the scores,intercept and coefficients are completely different in R and python.Which one is correct and I want to proceed in python.Now having confusion which results are accurate.