Putting together sklearn pipeline+nested cross-val

I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor that includes:

normalize features
feature selection (best subset of 20 numeric features, no specific total)
cross-validates hyperparameter K in range 1 to 20
cross-validates model
uses RMSE as error metric

There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.

Besides sklearn.neighbors.KNeighborsRegressor, I think I need:

sklearn.pipeline.Pipeline  
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score

sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel

Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10), 
                         X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)

rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)

It doesn't seem to error out, but a few of my concerns are:

Did I perform the nested cross-validation properly so that my RMSE is unbiased?
If I want the final model to be selected according to the best RMSE, am I supposed to use scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV?
Is SelectKBest, f_classif the best option to use for selecting features for the KNeighborsRegressor model?
How can I see:
- which subset of features was selected as best
- which K was selected as best

Any help is greatly appreciated!

Your code seems okay.

For the scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.

SelectKBest is a good approach but you can also use SelectFromModel or even other methods that you can find here

Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:

import ...


pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# changes here

grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")

grid.fit(X, y)

# get the best parameters and the best estimator
print("the best estimator is \n {} ".format(grid.best_estimator_))
print("the best parameters are \n {}".format(grid.best_params_))

# get the features scores rounded in 2 decimals
pip_steps = grid.best_estimator_.named_steps['kbest']

features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
print("the features scores are \n {}".format(features_scores))

feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))

# create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"

featurelist = ['age', 'weight']

features_selected_tuple=[(featurelist[i], features_scores[i], 
feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]

# Sort the tuple by score, in reverse order

features_selected_tuple = sorted(features_selected_tuple, key=lambda 
feature: float(feature[1]) , reverse=True)

# Print
print 'Selected Features, Scores, P-Values'
print features_selected_tuple

Results using my data:

the best estimator is
Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=18, p=2,
         weights='uniform'))])

the best parameters are
{'kbest__k': 2, 'regressor__n_neighbors': 18}

the features scores are
['8.98', '8.80']

the feature_pvalues is
['0.000', '0.000']

Selected Features, Scores, P-Values
[('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]