Sklearn confusion matrix estimation by cross valid

2019-09-15 22:40发布

问题:

I am trying to estimate the confusion matrix of a classifier using 10-fold cross-validation with sklearn.

To compute the confusion matrix I am using sklearn.metrics.confusion_matrix. I know that I can evaluate a model with cv using sklearn.model_selection.cross_val_score and sklearn.metrics.make_scorer like:

from sklearn.metrics import confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
cm = cross_val_score(clf, X, y, make_scorer(confusion_matrix))

Where clf is my classifier and X, y the feature and class vectors. However, this will raise an error since confusion_matrix does not return a float number but a matrix.

I've tried doing something like:

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold


def cv_confusion_matrix(clf, X, y, folds=10):
    skf = StratifiedKFold(n_splits=folds)
    cv_iter = skf.split(X, y)
    cms = []

    for train, test in cv_iter:
        clf.fit(X[train,], y[train])
        cm = confusion_matrix(y[test], clf.predict(X[test]), labels=clf.classes_)
        cms.append(cm)
    return np.mean(np.array(cms), axis=1)

This will work, but I missing the parallelism that sklearn has with cross_val_score and the n_jobs parameter.

Is there any way to do this and to take the advantage of the parallelism?