Is there any built-in way to get scikit-learn to perform shuffled stratified k-fold cross-validation? This is one of the most common CV methods, and I am surprised I couldn't find a built-in method to do this.
I saw that cross_validation.KFold()
has a shuffling flag, but it is not stratified. Unfortunately cross_validation.StratifiedKFold()
does not have such an option, and cross_validation.StratifiedShuffleSplit()
does not produce disjoint folds.
Am I missing something? Is this planned?
(obviously I can implement this by myself)
The shuffling flag for
cross_validation.StratifiedKFold
has been introduced in the current version 0.15:http://scikit-learn.org/0.15/modules/generated/sklearn.cross_validation.StratifiedKFold.html
This can be found in the Changelog:
http://scikit-learn.org/stable/whats_new.html#new-features
As far as I know, this is actually implemented in scikit-learn.
""" Stratified ShuffleSplit cross validation iterator
Provides train/test indices to split data in train test sets.
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. """
I thought I would post my solution in case it is useful to anyone else.
Here is my implementation of stratified shuffle split into training and testing set:
This code outputs: