I'm using GridSearchCV to identify the best set of parameters for a random forest classifier.
PARAMS = {
'max_depth': [8,None],
'n_estimators': [500,1000]
}
rf = RandomForestClassifier()
clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4)
clf.fit(data, labels)
where data and labels are respectively the full dataset and the corresponding labels.
Now, I compared the performance returned by the GridSearchCV (from clf.grid_scores_
) with a "manual" AUC estimation:
aucs = []
for fold in range (0,n_folds):
probabilities = []
train_data,train_label = read_data(train_file_fold)
test_data,test_labels = read_data(test_file_fold)
clf = RandomForestClassifier(n_estimators = 1000,max_depth=8)
clf = clf.fit(train_data,train_labels)
predicted_probs = clf.predict_proba(test_data)
for value in predicted_probs:
for k, pr in enumerate(value):
if k == 1:
probabilities.append(pr)
fpr, tpr, thresholds = metrics.roc_curve(test_labels, probabilities, pos_label=1)
fold_auc = metrics.auc(fpr, tpr)
aucs.append(fold_auc)
performance = np.mean(aucs)
where I manually pre-split the data into training and test set (same 5 CV approach).
The AUC values returned by GridSearchCV
are always higher than the one manually calculated (e.g. 0.62 vs. 0.70) when using the same parameter for RandomForest
.
I know that different training and test split might give you different performance but this occurred constantly when testing 100 repetitions of the GridSearchCV. Interesting, if I use the accuarcy
instead of roc_auc
as scoring metric, the difference in performance is minimal and can be associated to the fact that I use different training and test set. Is this happening because the AUC value of GridSearchCV
is estimated in a different way than by using metrics.roc_curve
?