I am new to scikit, and have 2 slight issues to combine a data scale and grid search.
- Efficient scaler
Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold.
My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described previsouly:
classifier = svm.SVC(C=1)
clf = make_pipeline(preprocessing.StandardScaler(), classifier)
tuned_parameters = [{'C': [1, 10, 100, 1000]}]
my_grid_search = GridSearchCV(clf, tuned_parameters, cv=5)
- Retrieve inner scaler fitting
When refit=True, "after" the Grid Search, the model is refit (using the best estimator) on the entire dataset, my understanding is that the pipeline will be used again, and therefore the scaler will be fit on the entire dataset. Ideally I would like to reuse that fit to scale my 'test' dataset. Is there a way to retrieve it directly from the GridSearchCV?