I am trying to estimate the confusion matrix of a classifier using 10-fold cross-validation with sklearn.
To compute the confusion matrix I am using sklearn.metrics.confusion_matrix
. I know that I can evaluate a model with cv using sklearn.model_selection.cross_val_score
and sklearn.metrics.make_scorer
like:
from sklearn.metrics import confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
cm = cross_val_score(clf, X, y, make_scorer(confusion_matrix))
Where clf
is my classifier and X
, y
the feature and class vectors. However, this will raise an error since confusion_matrix
does not return a float number but a matrix.
I've tried doing something like:
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
def cv_confusion_matrix(clf, X, y, folds=10):
skf = StratifiedKFold(n_splits=folds)
cv_iter = skf.split(X, y)
cms = []
for train, test in cv_iter:
clf.fit(X[train,], y[train])
cm = confusion_matrix(y[test], clf.predict(X[test]), labels=clf.classes_)
cms.append(cm)
return np.mean(np.array(cms), axis=1)
This will work, but I missing the parallelism that sklearn has with cross_val_score
and the n_jobs
parameter.
Is there any way to do this and to take the advantage of the parallelism?