I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor
that includes:
- normalize features
- feature selection (best subset of 20 numeric features, no specific total)
- cross-validates hyperparameter K in range 1 to 20
- cross-validates model
- uses RMSE as error metric
There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.
Besides sklearn.neighbors.KNeighborsRegressor
, I think I need:
sklearn.pipeline.Pipeline
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score
sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel
Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10),
X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)
rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)
It doesn't seem to error out, but a few of my concerns are:
- Did I perform the nested cross-validation properly so that my RMSE is unbiased?
- If I want the final model to be selected according to the best RMSE, am I supposed to use
scoring="neg_mean_squared_error"
for bothcross_val_score
andGridSearchCV
? - Is
SelectKBest, f_classif
the best option to use for selecting features for theKNeighborsRegressor
model? - How can I see:
- which subset of features was selected as best
- which K was selected as best
Any help is greatly appreciated!
Your code seems okay.
For the
scoring="neg_mean_squared_error"
for bothcross_val_score
andGridSearchCV
, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.SelectKBest
is a good approach but you can also useSelectFromModel
or even other methods that you can find hereFinally, in order to get the best parameters and the features scores I modified a bit your code as follows:
Results using my data: