I was running logistic regression on iris dataset on both R and Python.But both are giving different results(coefficients,intercept and scores).
#Python codes.
In[23]: iris_df.head(5)
Out[23]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
In[35]: iris_df.shape
Out[35]: (100, 5)
#looking at the levels of the Species dependent variable..
In[25]: iris_df['Species'].unique()
Out[25]: array([0, 1], dtype=int64)
#creating dependent and independent variable datasets..
x = iris_df.ix[:,0:4]
y = iris_df.ix[:,-1]
#modelling starts..
y = np.ravel(y)
logistic = LogisticRegression()
model = logistic.fit(x,y)
#getting the model coefficients..
model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
model_intercept = model.intercept_
In[30]: model_coef
Out[36]:
0 1
0 Sepal.Length [-0.402473917528]
1 Sepal.Width [-1.46382924771]
2 Petal.Length [2.23785647964]
3 Petal.Width [1.0000929404]
In[31]: model_intercept
Out[31]: array([-0.25906453])
#scores...
In[34]: logistic.predict_proba(x)
Out[34]:
array([[ 0.9837306 , 0.0162694 ],
[ 0.96407227, 0.03592773],
[ 0.97647105, 0.02352895],
[ 0.95654126, 0.04345874],
[ 0.98534488, 0.01465512],
[ 0.98086592, 0.01913408],
R codes.
> str(irisdf)
'data.frame': 100 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : int 0 0 0 0 0 0 0 0 0 0 ...
> model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.681e-05 -2.110e-08 0.000e+00 2.110e-08 2.006e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.556 601950.324 0 1
Sepal.Length -9.879 194223.245 0 1
Sepal.Width -7.418 92924.451 0 1
Petal.Length 19.054 144515.981 0 1
Petal.Width 25.033 216058.936 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 1.3166e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
Due to convergence problem,i increased the maximum iteration and gave epsilon as 0.05.
> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf,
control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.0102793 -0.0005659 -0.0000052 0.0001438 0.0112531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.796 704.352 0.003 0.998
Sepal.Length -3.426 215.912 -0.016 0.987
Sepal.Width -4.208 123.513 -0.034 0.973
Petal.Length 7.615 159.478 0.048 0.962
Petal.Width 11.835 285.938 0.041 0.967
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 5.3910e-04 on 95 degrees of freedom
AIC: 10.001
Number of Fisher Scoring iterations: 12
#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
1 2 3 4 5
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08
Both the scores,intercept and coefficients are completely different in R and python.Which one is correct and I want to proceed in python.Now having confusion which results are accurate.
UPDATED The problem is that there exists perfect separation along the petal width variable. In other words, this variable can be used to perfectly predict whether a sample in the given dataset is setosa or versicolor. This breaks the loglikelihood maximization estimation used in logistic regression in R. The problem is that the loglikelihood can be driven very high by taking the coefficient of petal width to the infinity.
Some background and strategies are discussed here.
There is also a good thread on CrossValidated discussing strategies.
So why does the sklearn LogisticRegression work? Because it employs "regularized logistic regression". The regularization penalizes estimating large values for parameters.
In the example below, I use the Firth's bias-reduced method of logistic regression package, logistf, to produce a converged model.
ORIGINAL Based on the std.error and z-values in the R solution, I think you have a bad model specification. A z-value of close to 0 essentially tells you there is no correlation between the model and the dependent variable. So this is a nonsensical model.
My first thought is that you need to transform that Species field into a categorical variable. It is an
int
type in your example. Try usingas.factor
How to convert integer into categorical data in R?