-->

Fitting in nested cross-validation with cross_val_

2019-03-06 11:29发布

问题:

I am working in scikit and I am trying to tune my XGBoost. I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.

from imblearn.pipeline import Pipeline 
from sklearn.model_selection import RepeatedKFold 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier


std_scaling = StandardScaler() 
algo = XGBClassifier()

steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]

pipeline = Pipeline(steps)

parameters = {'algo__min_child_weight': [1, 2],
              'algo__subsample': [0.6, 0.9],
              'algo__max_depth': [4, 6],
              'algo__gamma': [0.1, 0.2],
              'algo__learning_rate': [0.05, 0.5, 0.3]}

cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)

clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)

cv1 = RepeatedKFold(n_splits=2, n_repeats = 5,  random_state = 15)                       
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')

Question 1. How do I fit cross_val_score to the training data?

Question2. Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?

std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)

回答1:

You chose the right way to avoid data leakage as you say - nested CV.

The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.

Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV. Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.

For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.

The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".

Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).

The way to do that is to now fit your internal model over the entire data. meaning to perform:

clf_auc.fit(X, y)

This is the moment to understand what you've done here:

  1. You have a model you can use, which is fitted over all the data available.
  2. When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.

And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.