I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0
and 4,000 of them belong to class 1
. I made a train_test_split
where test_set is 0.2
of the whole dataset (around 4,800 samples in test_set
). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight
which is aimed to solve this issue.
The problem I'm facing the moment I'm setting class_weight='balanced'
and look at the confusion_matrix
of the training set I'm getting something like that:
array([[13209, 747],
[ 2776, 2468]])
As you can see, the lower array corresponds to False Negative = 2776
followed by True Positive = 2468
, while the upper array corresponds to True Negative = 13209
followed by False Positive = 747
. The problem is that the amount of samples belong to class 1
according to the confusion_matrix
is 2,776 (False Negative) + 2,468 (True Positive)
which sums up to 5,244 samples
belong to class 1
. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1
where only 3,200 of them are in the train_set
. It looks like the confusion_matrix
return a Transposed
version of the matrix, because the actual amount of samples belong to class 1
in the training_set
should sum up to 3,200 samples in train_set
and 800 in test_set
. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1
.
Can someone explain me what happens the moment I'm using class_weight
? Is it true that the confusion_matrix
returns a transposed
version of the matrix? Am I looking at it the wrong way?
I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.
Those are some of the sources I looked at:
scikit-learn: Random forest class_weight and sample_weight parameters
How to tune parameters in Random Forest, using Scikit Learn?
using sample_weight and class_weight in imbalanced dataset with RandomForest Classifier
Any help would be appreciated, thanks.
Reproducing the toy example from the docs:
So, the reading of the confusion matrix you have provided seems to be correct.
As the above example showed, no. But a very easy (and innocent-looking) mistake can be that you have interchanged the order of the
y_true
andy_pred
arguments, which does matter; the result would be indeed a transposed matrix:It is impossible to say if this is the reason from the info you have provided, which is a good reminder of why you should always provide your actual code, rather than a verbal description of what you think your code is doing...