Feature selection for multilabel classification (s

2019-04-13 06:16发布

I'm trying to do a feature selection by chi-square method in scikit-learn (sklearn.feature_selection.SelectKBest). When I'm trying to apply this to a multilabel problem, I get this warning:

UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task. warn("Duplicate scores. Result may depend on feature ordering."

Why is it appearning and how to properly apply feature selection is this case?

标签： scikit-learn feature-selection chi-squared

2条回答

我只想做你的唯一

2楼-- · 2019-04-13 06:31

The code warns you that arbitrary tie-breaking may need to be performed because some features have exactly the same score.

That said, feature selection does not actually work for multilabel out of the box; the best you can currently do is tie feature selection and a classifier together in a pipeline, then feed that to a multilabel meta-estimator. Example (untested):

clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                ('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)

(This warning is, I believe, issued even when the tied features aren't actually the k'th and (k+1)'th, I think. It can usually be ignored safely.)

0人赞添加讨论(0) 举报

Rolldiameter

3楼-- · 2019-04-13 06:47

I know the topic is a bit old but the following is working for me:

clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
            ('lasso', OneVsRestClassifier(LogisticRegression()))])

0人赞添加讨论(0) 举报

Feature selection for multilabel classification (s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间