I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor
that includes:
- normalize features
- feature selection (best subset of 20 numeric features, no specific total)
- cross-validates hyperparameter K in range 1 to 20
- cross-validates model
- uses RMSE as error metric
There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.
Besides sklearn.neighbors.KNeighborsRegressor
, I think I need:
sklearn.pipeline.Pipeline
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score
sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel
Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10),
X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)
rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)
It doesn't seem to error out, but a few of my concerns are:
- Did I perform the nested cross-validation properly so that my RMSE is unbiased?
- If I want the final model to be selected according to the best RMSE, am I supposed to use
scoring="neg_mean_squared_error"
for both cross_val_score
and GridSearchCV
?
- Is
SelectKBest, f_classif
the best option to use for selecting features for the KNeighborsRegressor
model?
- How can I see:
- which subset of features was selected as best
- which K was selected as best
Any help is greatly appreciated!
Your code seems okay.
For the scoring="neg_mean_squared_error"
for both cross_val_score
and GridSearchCV
, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.
SelectKBest
is a good approach but you can also use SelectFromModel
or even other methods that you can find here
Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:
import ...
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# changes here
grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")
grid.fit(X, y)
# get the best parameters and the best estimator
print("the best estimator is \n {} ".format(grid.best_estimator_))
print("the best parameters are \n {}".format(grid.best_params_))
# get the features scores rounded in 2 decimals
pip_steps = grid.best_estimator_.named_steps['kbest']
features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
print("the features scores are \n {}".format(features_scores))
feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))
# create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"
featurelist = ['age', 'weight']
features_selected_tuple=[(featurelist[i], features_scores[i],
feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]
# Sort the tuple by score, in reverse order
features_selected_tuple = sorted(features_selected_tuple, key=lambda
feature: float(feature[1]) , reverse=True)
# Print
print 'Selected Features, Scores, P-Values'
print features_selected_tuple
Results using my data:
the best estimator is
Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=18, p=2,
weights='uniform'))])
the best parameters are
{'kbest__k': 2, 'regressor__n_neighbors': 18}
the features scores are
['8.98', '8.80']
the feature_pvalues is
['0.000', '0.000']
Selected Features, Scores, P-Values
[('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]