Does sklearn support a cost matrix?

2019-06-16 05:56发布

问题:

Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.

The main classifier I am using is a Random Forest.

Thanks.

回答1:

The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.



回答2:

One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.



回答3:

You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.

On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.

Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.



回答4:

May not be direct to your question (since you are asking about Random Forest). But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.

You might want to refer to this page to see an example of using class_weight.



回答5:

You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:

def financial_loss_scorer(y, y_pred, **kwargs):
    import pandas as pd

    totals = kwargs['totals']

    # Create an indicator - 0 if correct, 1 otherwise
    errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
    # Use the product totals dataset to create results
    results = errors.merge(totals, left_index=True, right_index=True, how='inner')
    # Calculate per-prediction loss
    loss = results.Result * results.SumNetAmount

    return loss.sum()

The scorer becomes:

make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)

Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.