Why does sklearn Pipeline call transform() so many

2019-06-05 22:21发布

问题:

After a lot of reading and inspecting the pipeline.fit() operation under different verbose param settings, I'm still confused why a pipeline of mine visits a certain step's transform method so many times.

Below is a trivial example pipeline, fit with GridSearchCV, using 3-fold cross-validation, but a param-grid with only one set of hyperparams. So I expected three runs through the pipeline. Both step1 and step2 have fit called three times, as expected, but each step has transform called several more times. Why is this? Minimal code example and log output below.

# library imports
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline

# Load toy data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')

# Define a couple trivial pipeline steps
class mult_everything_by(TransformerMixin, BaseEstimator):

    def __init__(self, multiplier=2):
        self.multiplier = multiplier

    def fit(self, X, y=None):
        print "Fitting step 1"
        return self

    def transform(self, X, y=None):
        print "Transforming step 1"
        return X* self.multiplier

class do_nothing(TransformerMixin, BaseEstimator):

    def __init__(self, meaningless_param = 'hello'):
        self.meaningless_param=meaningless_param


    def fit(self, X, y=None):
        print "Fitting step 2"
        return self

    def transform(self, X, y=None):
        print "Transforming step 2"
        return X

# Define the steps in our Pipeline
pipeline_steps = [('step1', mult_everything_by()),
                  ('step2', do_nothing()), 
                  ('classifier', LogisticRegression()),
                  ]

pipeline = Pipeline(pipeline_steps)

# To keep this example super minimal, this param grid only has one set
# of hyperparams, so we are only fitting one type of model
param_grid = {'step1__multiplier': [2],   #,3],
              'step2__meaningless_param': ['hello']   #, 'howdy', 'goodbye']
              }

# Define model-search process/object
# (fit one model, 3-fits due to 3-fold cross-validation)
cv_model_search = GridSearchCV(pipeline, 
                               param_grid, 
                               cv = KFold(3),
                               refit=False, 
                               verbose = 0) 

# Fit all (1) models defined in our model-search object
cv_model_search.fit(X,y)

Output:

Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2

回答1:

Because you have used GridSearchCV with cv = KFold(3) which will do a cross-validation of your model. Here's what happens:

  1. It will split the data into two parts: train and test.
  2. For train, it will fit and transform each part of pipeline (excluding last, which is the classifier). Thats why you are seeing fit step1, transform step1, fit step2, transform step2.
  3. It will fit the transformed data on the classifier (which is not printed in your output.
  4. Edited Now comes the scoring part. Here we dont want to re-fit the parts again. We will use the information learnt during previous fitting. So each part of the pipeline will only call transform(). Thats the reason for Transforming step 1, Transforming step 2.

    Its showing two times because in GridSearchCV, default behaviour is to compute the score of both training and testing data. This behaviour is geverned by return_train_score. You can set return_train_score=False and will only see them once.

  5. This transformed test data will be used to predict the output from the classifier. (Again, no fitting on test, only predicting or transforming).

  6. The predicted values will be used to compare with the actual values to score the model.
  7. The steps 1-6 will be repeated 3 times (KFold(3)).
  8. Now have a look at your params:

    param_grid = {'step1__multiplier': [2], #,3], 'step2__meaningless_param': ['hello'] #, 'howdy', 'goodbye'] }

    When expanding, it becomes only single combination i.e.:

    Combination1: 'step1__multiplier'=2, 'step2__meaningless_param' = 'hello'

    If you have provided more options, which you have commented more combinations would be possible like:

    Combination1: 'step1__multiplier'=2, 'step2__meaningless_param' = 'hello'

    Combination2: 'step1__multiplier'=3, 'step2__meaningless_param' = 'hello'

    Combination3: 'step1__multiplier'=2, 'step2__meaningless_param' = 'howdy'

    and so on..

  9. The steps 1-7 will be repeated for each possible combination.

  10. The combination which gave highest average score on the test folds of the cross-validation will be chosen to finally fit the model with complete data (no division into train and test).
  11. But you have kept refit=False . So the model will not be fitted again. Else you would have seen one more output of

    Fitting step 1 Transforming step 1 Fitting step 2 Transforming step 2

Hope this clears this up. Feel free to ask any more info.