Scikit - changing the threshold to create multiple

2020-06-16 02:33发布

问题:

I'm building a classifier that goes through lending club data, and selects the best X loans. I've trained a Random Forest, and created the usual ROC curves, Confusion Matrices, etc.

The confusion matrix takes as an argument the predictions of the classifier (the majority prediction of the trees in the forest). However, I wish to print multiple confusion matrices at different thresholds, to know what happens if I choose the 10% best loans, the 20% best loans, etc.

I know from reading other questions that changing the threshold is often a bad idea, but is there any other way to see confusion matrices for these situations? (question A)

If I go ahead with changing the threshold, should I assume that the best way to do so it to predict proba and then threshold it by hand, passing that to the Confusion Matrix? (question B)

回答1:

A. In your case, changing the threshold is admissible and maybe even necessary. The default threshold is at 50%, but from business point of view even 15% probability of non-repayment might be enough to reject such an application.

In fact, in credit scoring it is common to set different cut-offs for different product terms or customer segments, after predicting probability of default with a common model (see e.g. chapter 9 of "Credit Risk Scorecards" by Naeem Siddiqi).

B. There are two convenient ways to threshold at arbitrary alpha instead of 50%:

  1. Indeed, predict_proba and threshold it to alpha manually, or with a wrapper class (see the code below). Use this if you want to try multiple thresholds without refitting the model.
  2. Change class_weights to (alpha, 1-alpha) before fitting the model.

And now, a sample code for the wrapper:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.base import BaseEstimator, ClassifierMixin
X, y = make_classification(random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

class CustomThreshold(BaseEstimator, ClassifierMixin):
    """ Custom threshold wrapper for binary classification"""
    def __init__(self, base, threshold=0.5):
        self.base = base
        self.threshold = threshold
    def fit(self, *args, **kwargs):
        self.base.fit(*args, **kwargs)
        return self
    def predict(self, X):
        return (self.base.predict_proba(X)[:, 1] > self.threshold).astype(int)

rf = RandomForestClassifier(random_state=1).fit(X_train, y_train)
clf = [CustomThreshold(rf, threshold) for threshold in [0.3, 0.5, 0.7]]

for model in clf:
    print(confusion_matrix(y_test, model.predict(X_test)))

assert((clf[1].predict(X_test) == clf[1].base.predict(X_test)).all())
assert(sum(clf[0].predict(X_test)) > sum(clf[0].base.predict(X_test)))
assert(sum(clf[2].predict(X_test)) < sum(clf[2].base.predict(X_test)))

It will output 3 confusion matrices for different thresholds:

[[13  1]
 [ 2  9]]
[[14  0]
 [ 3  8]]
[[14  0]
 [ 4  7]]