Using Smote with Gridsearchcv in Scikit-learn

2019-02-28 13:44发布

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv. My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled. Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this? Thanks a lot in advance

标签： machine-learning scikit-learn grid-search oversampling

1条回答

一夜七次

2楼-- · 2019-02-28 14:33

Yes, it can be done, but with imblearn Pipeline.

You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.

When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer. You can confirm that by looking at the source code here:

        if hasattr(transform, "fit_sample"):
            pass
        else:
            Xt = transform.transform(Xt)

So for this to work correctly, you need the following:

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', LogisticRegression())
    ])

grid = GridSearchCV(model, params, ...)
grid.fit(X, y)

Fill the details as necessary, and the pipeline will take care of the rest.

0人赞添加讨论(0) 举报

Using Smote with Gridsearchcv in Scikit-learn

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间