This is probably a simple question but I am trying to calculate the p-values for my features either using classifiers for a classification problem or regressors for regression. Could someone suggest what is the best method for each case and provide sample code? I want to just see the p-value for each feature rather than keep the k best / percentile of features etc as explained in the documentation.
Thank you
Just run the significance test on X, y
directly. Example using 20news and chi2
:
>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.feature_selection import chi2
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data.data, data.target
>>> scores, pvalues = chi2(X, y)
>>> pvalues
array([ 4.10171798e-17, 4.34003018e-01, 9.99999996e-01, ...,
9.99999995e-01, 9.99999869e-01, 9.99981414e-01])