I get different values for different runs. What am I doing wrong here?
X=np.random.random((100,5))
y=np.random.randint(0,2,(100,))
clf=RandomForestClassifier()
cv = StratifiedKFold(y, random_state=1)
s = cross_val_score(clf, X,y,scoring='roc_auc', cv=cv)
print(s)
# [ 0.42321429 0.44360902 0.34398496]
s = cross_val_score(clf, X,y,scoring='roc_auc', cv=cv)
print(s)
# [ 0.42678571 0.46804511 0.36090226]
The mistake you are making is calling the RandomForestClassifier
whose default arg, random_state
is None. So, it picks up the seed generated by np.random
to produce the random output.
The random_state
in both StratifiedKFold
and RandomForestClassifier
need to be the same inorder to produce equal arrays of scores of cross validation.
Illustration:
X=np.random.random((100,5))
y=np.random.randint(0,2,(100,))
clf = RandomForestClassifier(random_state=1)
cv = StratifiedKFold(y, random_state=1) # Setting random_state is not necessary here
s = cross_val_score(clf, X,y,scoring='roc_auc', cv=cv)
print(s)
##[ 0.57612457 0.29044118 0.30514706]
print(s)
##[ 0.57612457 0.29044118 0.30514706]
Another way of countering it would be to not provide random_state
args for both RFC and SKF. But, simply providing the np.random.seed(value)
to create the random integers at the beginning. These would also create equal arrays at the output.