Using sklearn , I want to have 3 splits (i.e. n_splits = 3)in the sample dataset and have a Train/Test ratio as 70:30. I'm able split the set into 3 folds but not able to define the test size (similar to train_test_split method).Is there a way to do define test sample size in StratifiedKFold ?
from sklearn.model_selection import StratifiedKFold as SKF
skf = SKF(n_splits=3)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
# Loops over 3 iterations to have Train test stratified split
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
StratifiedKFold
does by definition a K-fold split. This is, the iterator returned will yield (K-1
) sets for training while 1
set for testing. K
is controlled by n_splits
, and thus, it does create groups of n_samples/K
, and use all combinations of K-1
for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.
In short, the size of the test set will be 1/K
(i.e. 1/n_splits
), so you can tune that parameter to control the test size (e.g. n_splits=3
will have test split of size 1/3 = 33%
of your data). However, StratifiedKFold
will iterate over K
groups of K-1
, and might not be what you want.
Having said that, you might be interested in StratifiedShuffleSplit, which returns just configurable number of splits and train/test ratio. If you just want a single split, you can tune n_splits=1
and yet keep test_size=0.3
(or whatever ratio you want).