-->

keras image preprocessing unbalanced data

2020-08-15 10:26发布

问题:

All,

I'm trying to use Keras to do image classification on two classes. For one class, I have very limited number of images, say 500. As for the other class, I have almost infinite number of images. So if I want to use keras image preprocessing, how to do that? Ideally, I need something like this. For class one, I feed 500 images and use ImageDataGenerator to get more images. For class two, each time I extract 500 images in sequence from 1000000 image dataset and probably no data augmentation needed. While looking at the example here and also Keras documentation, I found the training folder contains equal number of images for each class by default. So my question is that is there existing APIs for doing this trick? If so, please kindly point it out to me. If not, is there any workaround to this needs?

回答1:

You have some options.

Option 1

Use the class_weight parameter of the fit() function which is a dictionary mapping classes to a weight value. Lets say you have 500 samples of class 0 and 1500 samples of class 1 than you feed in class_weight = {0:3 , 1:1}. That gives class 0 three times the weight of class 1.

train_generator.classes gives you the proper class names for your weighting.

If you want to calculate this programmatically than you could use scikit-learn´s sklearn.utils.compute_class_weight(): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/class_weight.py

The function looks at the distribution of labels and produces weights to equally penalize under or over-represented classes in the training set.

See also this useful thread here: https://github.com/fchollet/keras/issues/1875

This thread might also be of help: Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

Option 2

You use a dummy training run with a generator where you apply your image augmentation like rotation, scaling, cropping, flipping etc. and save the augmented images for the real training later. By that you can create a bigger or even balanced dataset for your underrepresented class.

In this dummy run you set save_to_dir in the flow_from_directory function to a folder of your choosing and later on only take the images from the class that you need more samples of. You obviously discard any training results since you only use this run to get more data.