I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV
over different values of C
.
However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV
?
QUICK SOLUTION:
Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
You can supply different cross-validation generators to
GridSearchCV
. The default for binary or multiclass classification problems isStratifiedKFold
. Otherwise, it usesKFold
. But you can supply your own. In your case, it looks like you wantRepeatedKFold
orRepeatedStratifiedKFold
.This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
Edit - Description of nested cross validation with
cross_val_score()
andGridSearchCV()
clf, X, y, outer_cv
tocross_val_score
X
will be divided intoX_outer_train, X_outer_test
usingouter_cv
. Same for y.X_outer_test
will be held back andX_outer_train
will be passed on to clf for fit() (GridSearchCV in our case). AssumeX_outer_train
is calledX_inner
from here on since it is passed to inner estimator, assumey_outer_train
isy_inner
.X_inner
will now be split intoX_inner_train
andX_inner_test
usinginner_cv
in the GridSearchCV. Same for yX_inner_train
andy_train_inner
and scored usingX_inner_test
andy_inner_test
.(X_inner_train, X_inner_test)
is best, is passed on to theclf.best_estimator_
and fitted for all data, i.e.X_outer_train
.clf
(gridsearch.best_estimator_
) will then be scored usingX_outer_test
andy_outer_test
.cross_val_score
nested_score
.