I am trying to split a dataset for cross validation and GridSearch in sklearn. I want to define my own split but GridSearch only takes the built in cross-validation methods.
However, I can't use the built in cross validation method because I need certain groups of examples to be in the same fold. So, if I have examples: [A1, A2, A3, A4, A5, B1, B2, B3, C1, C2, C3, C4, .... , Z1, Z2, Z3]
I want to perform cross validation such that examples from each group [A,B,C...] only exist in one fold.
ie K1 contains [D,E,G,J,K...], K2 contains [A,C,L,M,...], K3 contains [B,F,I,...] etc
I know this question is quite old, but I had the same problem. Looks like there will soon be a contribution that lets you do this:
https://github.com/scikit-learn/scikit-learn/pull/4583
This type of thing can usually be done with
sklearn.cross_validation.LeaveOneLabelOut
. You just need to construct a label vector that encodes your groups. I.e., all samples inK1
would take label1
, all samples inK2
would take label 2, and so on.Here is a fully runnable example with fake data. The important lines are the one creating the
cv
object, and the call tocross_val_score
However, it is of course possible that you run into a situation where you would like to define your folds by hand completely. In this case you would need to create an
iterable
(e.g. alist
) of couples(train, test)
indicating via indices which samples to take into your train and test sets of each fold. Let's check this: