Is there any built-in way to get scikit-learn to perform shuffled stratified k-fold cross-validation? This is one of the most common CV methods, and I am surprised I couldn't find a built-in method to do this.

I saw that cross_validation.KFold() has a shuffling flag, but it is not stratified. Unfortunately cross_validation.StratifiedKFold() does not have such an option, and cross_validation.StratifiedShuffleSplit() does not produce disjoint folds.

Am I missing something? Is this planned?

(obviously I can implement this by myself)

The shuffling flag for cross_validation.StratifiedKFold has been introduced in the current version 0.15:


This can be found in the Changelog:


Shuffle option for cross_validation.StratifiedKFold. By Jeffrey Blackburne.

As far as I know, this is actually implemented in scikit-learn.

""" Stratified ShuffleSplit cross validation iterator

Provides train/test indices to split data in train test sets.

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. """

I thought I would post my solution in case it is useful to anyone else.

from collections import defaultdict
import random
def strat_map(y):
    Returns permuted indices that maintain class
    smap = defaultdict(list)
    for i,v in enumerate(y):
    for values in smap.values():
    y_map = np.zeros_like(y)
    for i,v in enumerate(y):
        y_map[i] = smap[v].pop()
    return y_map

#Example Use
skf = StratifiedKFold(y, nfolds)
sm = strat_map(y)
for test, train in skf:
    test,train = sm[test], sm[train]
    #then cv as usual

import numpy.random as rnd
for _ in range(100):
    y = np.array( [0]*10 + [1]*20 + [3] * 10)
    sm = strat_map(y)
    shuffled = y[sm]
    assert (sm != range(len(y))).any() , "did not shuffle"
    assert (shuffled == y).all(), "classes not in right position"
    assert (set(sm) == set(range(len(y)))), "missing indices"

for _ in range(100):
    nfolds = 10
    skf = StratifiedKFold(y, nfolds)
    sm = strat_map(y)
    for test, train in skf:
        assert (sm[test] != test).any(), "did not shuffle"
        assert (y[sm[test]] == y[test]).all(), "classes not in right position"
Here is my implementation of stratified shuffle split into training and testing set:

import numpy as np

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    test sets are preserved (stratified sampling).

    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        n = int(train_proportion*len(value_inds))


    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]
