I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold
method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.
In addition to the accepted answer by @Andreas Mueller, just want to add that as @tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y) with added features of:
[update for 0.17]
See the docs of
sklearn.model_selection.train_test_split
:[/update for 0.17]
There is a pull request here. But you can simply do
train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.Here's an example for continuous/regression data (until this issue on GitHub is resolved).
TL;DR : Use StratifiedShuffleSplit with
test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
n_folds
training/testing sets such that classes are equally balanced in both.Heres some code(directly from above documentation)
n_iter=1
. You can mention the test-size here same as intrain_test_split
Code: