class_weight hyperparameter in Random Forest chang

2019-06-11 10:13发布

I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0 and 4,000 of them belong to class 1. I made a train_test_split where test_set is 0.2 of the whole dataset (around 4,800 samples in test_set). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight which is aimed to solve this issue.

The problem I'm facing the moment I'm setting class_weight='balanced' and look at the confusion_matrix of the training set I'm getting something like that:

array([[13209, 747], [ 2776, 2468]])

As you can see, the lower array corresponds to False Negative = 2776 followed by True Positive = 2468, while the upper array corresponds to True Negative = 13209 followed by False Positive = 747. The problem is that the amount of samples belong to class 1 according to the confusion_matrix is 2,776 (False Negative) + 2,468 (True Positive) which sums up to 5,244 samples belong to class 1. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1 where only 3,200 of them are in the train_set. It looks like the confusion_matrix return a Transposed version of the matrix, because the actual amount of samples belong to class 1 in the training_set should sum up to 3,200 samples in train_set and 800 in test_set. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1. Can someone explain me what happens the moment I'm using class_weight? Is it true that the confusion_matrix returns a transposed version of the matrix? Am I looking at it the wrong way? I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.

Those are some of the sources I looked at:

scikit-learn: Random forest class_weight and sample_weight parameters

How to tune parameters in Random Forest, using Scikit Learn?

https://datascience.stackexchange.com/questions/11564/how-does-class-weights-work-in-randomforestclassifier

https://stats.stackexchange.com/questions/244630/difference-between-sample-weight-and-class-weight-randomforest-classifier

using sample_weight and class_weight in imbalanced dataset with RandomForest Classifier

Any help would be appreciated, thanks.

1条回答
我命由我不由天
2楼-- · 2019-06-11 10:15

Reproducing the toy example from the docs:

from sklearn.metrics import confusion_matrix

y_true = [0, 1, 0, 1]
y_pred = [1, 1, 1, 0]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
(tn, fp, fn, tp)
# (0, 2, 1, 1)

So, the reading of the confusion matrix you have provided seems to be correct.

Is it true that the confusion_matrix returns a transposed version of the matrix?

As the above example showed, no. But a very easy (and innocent-looking) mistake can be that you have interchanged the order of the y_true and y_pred arguments, which does matter; the result would be indeed a transposed matrix:

# correct order of arguments:
confusion_matrix(y_true, y_pred)
# array([[0, 2],
#        [1, 1]])

# inverted (wrong) order of the arguments:
confusion_matrix(y_pred, y_true)
# array([[0, 1],
#        [2, 1]])

It is impossible to say if this is the reason from the info you have provided, which is a good reminder of why you should always provide your actual code, rather than a verbal description of what you think your code is doing...

查看更多
登录 后发表回答