I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:
grid_search.fit(X, y)
y_pr=grid_search.predict(X)
I do not understand the following:
why grid_search.score(X,y)
and roc_auc_score(y, y_pr)
give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?
This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
Observe the third parameter
needs_threshold
. When true, it will require the continous values fory_pred
such as probabilities or confidence scores which in gridsearch will be calculated fromlog_reg.decision_function()
.When you explicitly call
roc_auc_score
withy_pr
, you are using.predict()
which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.Try :
If still not same results, please update the question with complete code and some sample data.