After a lot of reading and inspecting the pipeline.fit() operation under different verbose
param settings, I'm still confused why a pipeline of mine visits a certain step's transform
method so many times.
Below is a trivial example pipeline
, fit
with GridSearchCV
, using 3-fold cross-validation, but a param-grid with only one set of hyperparams. So I expected three runs through the pipeline. Both step1
and step2
have fit
called three times, as expected, but each step has transform
called several more times. Why is this? Minimal code example and log output below.
# library imports
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline
# Load toy data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')
# Define a couple trivial pipeline steps
class mult_everything_by(TransformerMixin, BaseEstimator):
def __init__(self, multiplier=2):
self.multiplier = multiplier
def fit(self, X, y=None):
print "Fitting step 1"
return self
def transform(self, X, y=None):
print "Transforming step 1"
return X* self.multiplier
class do_nothing(TransformerMixin, BaseEstimator):
def __init__(self, meaningless_param = 'hello'):
self.meaningless_param=meaningless_param
def fit(self, X, y=None):
print "Fitting step 2"
return self
def transform(self, X, y=None):
print "Transforming step 2"
return X
# Define the steps in our Pipeline
pipeline_steps = [('step1', mult_everything_by()),
('step2', do_nothing()),
('classifier', LogisticRegression()),
]
pipeline = Pipeline(pipeline_steps)
# To keep this example super minimal, this param grid only has one set
# of hyperparams, so we are only fitting one type of model
param_grid = {'step1__multiplier': [2], #,3],
'step2__meaningless_param': ['hello'] #, 'howdy', 'goodbye']
}
# Define model-search process/object
# (fit one model, 3-fits due to 3-fold cross-validation)
cv_model_search = GridSearchCV(pipeline,
param_grid,
cv = KFold(3),
refit=False,
verbose = 0)
# Fit all (1) models defined in our model-search object
cv_model_search.fit(X,y)
Output:
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2