From the scikit-learn RFE documentation, successively smaller sets of features are selected by the algorithm and only the features with the highest weights are preserved. Features with low weights are dropped and this process repeats itself until the number of features remaining matches that specified by the user (or is taken to be half of the original number of features by default).
The RFECV docs indicate that the features are ranked with RFE and KFCV.
We have a set of 25 features in the code shown in the documentation example for RFECV:
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV,RFE
from sklearn.datasets import make_classification
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0)
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),scoring='accuracy')
rfecv.fit(X, y)
rfe = RFE(estimator=svc, step=1)
rfe.fit(X, y)
print('Original number of features is %s' % X.shape[1])
print("RFE final number of features : %d" % rfe.n_features_)
print("RFECV final number of features : %d" % rfecv.n_features_)
print('')
import numpy as np
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for f in range(X.shape[1]):
print("%d. Number of features: %d;
Grid_Score: %f" % (f + 1, indices[f]+1, g_scores[indices[f]]))
Here is the output I get:
Original number of features is 25
RFE final number of features : 12
RFECV final number of features : 3
Printing RFECV results:
1. Number of features: 3; Grid_Score: 0.818041
2. Number of features: 4; Grid_Score: 0.816065
3. Number of features: 5; Grid_Score: 0.816053
4. Number of features: 6; Grid_Score: 0.799107
5. Number of features: 7; Grid_Score: 0.797047
6. Number of features: 8; Grid_Score: 0.783034
7. Number of features: 10; Grid_Score: 0.783022
8. Number of features: 9; Grid_Score: 0.781992
9. Number of features: 11; Grid_Score: 0.778028
10. Number of features: 12; Grid_Score: 0.774052
11. Number of features: 14; Grid_Score: 0.762015
12. Number of features: 13; Grid_Score: 0.760075
13. Number of features: 15; Grid_Score: 0.752003
14. Number of features: 16; Grid_Score: 0.750015
15. Number of features: 18; Grid_Score: 0.750003
16. Number of features: 22; Grid_Score: 0.748039
17. Number of features: 17; Grid_Score: 0.746003
18. Number of features: 19; Grid_Score: 0.739105
19. Number of features: 20; Grid_Score: 0.739021
20. Number of features: 21; Grid_Score: 0.738003
21. Number of features: 23; Grid_Score: 0.729068
22. Number of features: 25; Grid_Score: 0.725056
23. Number of features: 24; Grid_Score: 0.725044
24. Number of features: 2; Grid_Score: 0.506952
25. Number of features: 1; Grid_Score: 0.272896
In this particular example:
- For RFE: the code always returns 12 features (approx. half of 25, as expected from the documentation)
- For RFECV, the code returns a different number from 1-25 (not half of the number of features)
It seems to me that when RFECV is being selected, the number of features is being picked only based on the KFCV scores - i.e. the cross validation scores are over-riding RFE's successive pruning of features.
Is this true? If one would like to use the native recursive feature elimination algorithm, then is RFECV using this algorithm or is it using a hybrid version of it?
In RFECV, is the cross-validation being done on the subset of features remaining after pruning? If so, how many features are kept after each prune in RFECV?
In the cross validated version, the features are re-ranked at each step and the lowest ranked feature is dropped -- this is referred to as "recursive feature selection" in the docs.
If you want to compare this to the naive version, you'll need to compute the the cross-validated score for the features selected by the RFE. My guess is that RFECV answer is correct -- judging from the sharp increase in model performance when the features decrease, you probably have some highly correlated features which are harming your model's performance.