Neural Network - Working with a imbalanced dataset

2019-05-28 21:36发布

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).

The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.

Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?

Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?

Thanks for your help ! Paul


Update :

Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.

1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion. However, for my situation, It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ... With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/ I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat

2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.

3) Using a smaller batch size seems indeed a good idea. I'll try it !

4条回答
来,给爷笑一个
2楼-- · 2019-05-28 21:46

I will expand a bit on chasep's answer. If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as @chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):

 L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos} 

With \alpha greater than 1.

Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :

cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))

Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.

Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives. I would go with the first option as it is slightly easier to do with TF.

查看更多
老娘就宠你
3楼-- · 2019-05-28 21:46

Yes - neural network could help in your case. There are at least two approaches to such problem:

  1. Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
  2. Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
  3. You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
查看更多
祖国的老花朵
4楼-- · 2019-05-28 21:50

There are usually two common ways for imbanlanced dataset:

  1. Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.

  2. Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection

查看更多
Lonely孤独者°
5楼-- · 2019-05-28 22:02

One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.

查看更多
登录 后发表回答