I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?
问题:
回答1:
As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.
For those with an academic interest in this ongoing issue, the paper from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.
Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.
回答2:
As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.
Here is the code from the documentation
from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
回答3:
So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.
You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.
Then use np.round(X_train[categorical_variables])
to convert them back to the respective categorical values.