Oversampling: SMOTE for binary and categorical dat

2019-05-07 00:13发布

问题:

I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?

回答1:

As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.

For those with an academic interest in this ongoing issue, the paper from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.

Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.



回答2:

As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.

Here is the code from the documentation

from imblearn.over_sampling import SMOTENC smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0) X_resampled, y_resampled = smote_nc.fit_resample(X, y)



回答3:

So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.

You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.

Then use np.round(X_train[categorical_variables]) to convert them back to the respective categorical values.