-->

Nested cross validation with StratifiedShuffleSpli

2019-04-15 06:42发布

问题:

I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively.

My code is quite simple:

>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=5)
>> cross_val_score(grid_search, X, y, cv=5)

So far, so good. If I want to change the CV scheme (from random splitting to StratifiedShuffleSplit in both, outer and inner CV loops, I face the problem: how can I pass the class vector y, as it is required by the StratifiedShuffleSplit function?

Naively:

>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=StratifiedShuffleSplit(y_inner_loop, 5, test_size=0.5, random_state=0))
>> cross_val_score(grid_search, X, y, cv=StratifiedShuffleSplit(y, 5, test_size=0.5, random_state=0))

So, the problem is how to specify the y_inner_loop ?

** My data set is slightly imbalanced (20/10) and I would like to keep this splitting ratio for training and assessing the model.

回答1:

So far, I resolved this problem which might be of interested to some novices to ML. In the newest version of the scikit-learn 0.18, cross validated metrics have moved to sklearn.model_selection module and have changed (slightly) their API. Making long story short:

>> from sklearn.model_selection     import StratifiedShuffleSplit
>> sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
>> sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=sss_inner)
>> cross_val_score(grid_search, X, y, cv=sss_outer)

UPD in the newest version, we do not need to specify explicitly the target vector ("y", which was my problem initially), but rather only the number of desired splits.