How to balance unbalanced classification 1:1 with

2020-07-13 06:32发布

问题:

I am doing binary classification and my current target class is composed of: Bad: 3126 Good:25038

So I want the number of Bad (minority) examples to equal the number of Good examples (1:1). So Bad needs to increase by ~8x (extra 21912 SMOTEd instances) and not increase the majority (Good). The code I am trying will not keep the number of Good constant, as currently.

Code I have tried:

Example 1:

library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=0, k=5, learner=NULL)

Example 1 output: Bad:25008 Good:0

Example 2:

smoted_data <- SMOTE(targetclass~., data, perc.over=700, k=5, learner=NULL)

Example 2 output: Bad: 25008 Good:43764

Example 3:

smoted_data <- SMOTE(targetclass~., data, perc.over=700, perc.under=100, k=5, learner=NULL)

Example 3 output: Bad: 25008 Good: 21882

回答1:

To achieve a 1:1 balance using SMOTE, you want to do this:

library(DMwR)
smoted_data <- SMOTE(targetclass~., data, perc.over=100)

I have to admit it doesn't seem obvious from the built-in documentation, but if you read the original documentation, it states:

The parameters perc.over and perc.under control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively.

perc.over will typically be a number above 100. For each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that class will be created. If perc.over is a value below 100 than a single case will be generated for a randomly selected proportion (given by perc.over/100) of the cases belonging to the minority class on the original data set.

So when perc.over is 100, you essentially creating 1 new example (100/100 = 1).

The default of perc.under is 200, and that is what you want to keep.

The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases.

prop.table(table(smoted_data$targetclass))
# returns 0.5  0.5


回答2:

You can try using the ROSE package in R.

A research article with example is available here



回答3:

You shoud use a perc.under of 114.423. Since (700/100)x3126x(114.423/100)=25038.04.

But note that since SMOTE does a random undersampling for the majority class, this way you would get a new data with duplicates in the majority class. That is to say, your new data will have 25038 GOOD samples, but they are not the same 25038 GOOD samples with the original data. Some GOOD samples will not be included and some will be duplicated in the newly generated data.



回答4:

I recommend you to use the bimba package under development by me. It is not yet available on CRAN but you can easily install it from github.

You can find instructions on how to install it on its github page: https://github.com/RomeroBarata/bimba

The only restriction on the data for the use of the SMOTE function implemented in bimba is that the predictors must be numeric and the target must be both the last column of the data frame that holds your data and have only two values.

As long as your data abide by these restrictions, using the SMOTE function is easy:

library(bimba)
smoted_data <- SMOTE(data, perc_min = 50, k = 5)

Where perc_min specifies the desired percentage of the minority class after over-sampling (in that case perc_min = 50 balance the classes). Note that the majority class is not under-sampled as in the DMwR package.