Unbalanced labels - Better results in Confusion Ma

2019-09-14 16:54发布

问题:

I've unbalanced labels. That is, in binary classifier, I've more positives (1) data and less negatives (0) data. I'm using Stratified K Fold Cross Validation and getting true negatives as zero. Could you please let me know what options I have to get a value greater tan zero for true negatives?

回答1:

There are quite a lot of strategies for dealing with imbalanced classes.

First, let's understand what is (probably) happening. You are asking your classifier to maximize accuracy: that is, the fraction of records that were correctly classified. If, say, 85% of the records are in Class A, then you will get 85% accuracy by just labelling everything as Class A. And this seems to be the best the classifier can achieve.

So, how can you correct for this?

1) You can try training you model on a balanced sub-set of your data. Randomly sample from the majority class only a number of records equal to those present in the minority class. This won't allow your classifier to get away with labelling everything as the majority class. But it will come at the cost of having less information available to discover the structure of the class boundary.

2) Use a different optimization metric than accuracy. Popular choices are AUC or F1 Score

3) Create an ensemble of classifiers using method 1. Each classifier will see a subset of the data and 'vote' on a class, possibly with some confidence score. Each of these classifier outputs will be a feature for a final meta-classifier (possibly build using method 2). This way you can get access to all of the information available.

This is far from an exhaustive list of solutions. Working with imbalanced (or 'skewed') datasets could be an entire text book. I would recommend reading some papers on this topic. Perhaps starting here