Does sklearn support a cost matrix?

Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.

The main classifier I am using is a Random Forest.

Thanks.

标签： classification scikit-learn random-forest

5条回答

劫难

2楼-- · 2019-06-16 06:07

You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.

On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.

Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.

0人赞添加讨论(0) 举报

Lonely孤独者°

3楼-- · 2019-06-16 06:11

The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.

0人赞添加讨论(0) 举报

虎瘦雄心在

4楼-- · 2019-06-16 06:28

You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:

def financial_loss_scorer(y, y_pred, **kwargs):
    import pandas as pd

    totals = kwargs['totals']

    # Create an indicator - 0 if correct, 1 otherwise
    errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
    # Use the product totals dataset to create results
    results = errors.merge(totals, left_index=True, right_index=True, how='inner')
    # Calculate per-prediction loss
    loss = results.Result * results.SumNetAmount

    return loss.sum()

The scorer becomes:

make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)

Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.

0人赞添加讨论(0) 举报

贼婆χ

5楼-- · 2019-06-16 06:29

May not be direct to your question (since you are asking about Random Forest). But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.

You might want to refer to this page to see an example of using class_weight.

0人赞添加讨论(0) 举报

\"骚年 ilove

6楼-- · 2019-06-16 06:31

One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.

0人赞添加讨论(0) 举报

Does sklearn support a cost matrix?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间