Sklearn Chi2 For Feature Selection

2019-06-26 08:19发布

问题:

I'm learning about chi2 for feature selection and came across code like this

However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in features with the lowest scores. However, using scikit learns SelectKBest, the selector returns the values with the highest chi2 scores. Is my understanding of using the chi2 test incorrect? Or does the chi2 score in sklearn produce something other than a chi2 statistic?

See code below for what I mean (mostly copied from above link except for the end)

from sklearn.datasets import load_iris
# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest

回答1:

Your understanding is reversed.

The null hypothesis for chi2 test is that "two categorical variables are independent". So a higher value of chi2 statistic means "two categorical variables are dependent" and MORE USEFUL for classification.

SelectKBest gives you the best two (k=2) features based on higher chi2 values. Thus you need to get those features that it gives, rather that getting the "other features" on the chi2 selector.

You are correct to get the chi2 statistic from chi2_selector.scores_ and the best features from chi2_selector.get_support(). It will give you 'petal length (cm)' and 'petal width (cm)' as top 2 features based on chi2 test of independence test. Hope it clarifies this algorithm.