I am trying to use train_test_split
from package scikit Learn, but I am having trouble with parameter stratify
. Hereafter is the code:
from sklearn import cross_validation, datasets
X = iris.data[:,:2]
y = iris.target
cross_validation.train_test_split(X,y,stratify=y)
However, I keep getting the following problem:
raise TypeError("Invalid parameters passed: %s" % str(options))
TypeError: Invalid parameters passed: {'stratify': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}
Does someone have an idea what is going on? Below is the function documentation.
[...]
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the labels array.
New in version 0.17: stratify splitting
[...]
For my future self who comes here via Google:
train_test_split
is now inmodel_selection
, hence:is the way to use it. Setting the
random_state
is desirable for reproducibility.Try running this code, it "just works":
In this context, stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset.
Scikit-Learn is just telling you it doesn't recognise the argument "stratify", not that you're using it incorrectly. This is because the parameter was added in version 0.17 as indicated in the documentation you quoted.
So you just need to update Scikit-Learn.
This
stratify
parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameterstratify
.For example, if variable
y
is a binary categorical variable with values0
and1
and there are 25% of zeros and 75% of ones,stratify=y
will make sure that your random split has 25% of0
's and 75% of1
's.