What's wrong here? accidentally referencing an

2019-05-27 00:08发布

问题:

I'm an R user looking to get more comfortable with Python. I wrote a kind of mini-API that makes it easy to compare different statistical models fitted to the same data, in such a way that I can pre-set all the model hyperparameters and then iterate over the different models in order to fit them.

This is the essence of what I want to do:

  1. Build a wrapper class Classifier around a Scikit-learn Pipeline, in turn built on one of Scikit-learn's built-in estimators, e.g. RandomForestClassifier
  2. Create a dictionary of these un-fitted Classifiers, and a different dictionary of parameters to loop over
  3. Iterate over both dictionaries, have each un-fitted Classifier generate a new instance of the underlying Pipeline, fit it using its [Pipeline.fit][1] method, and save the new, fitted Pipeline in a different dictionary

However, it seems that, instead of generating a new instance of the Pipeline, in each iteration, the same instance of the Pipeline (or maybe the underlying estimator) is being refitted. This is a problem because the Pipeline.fit method modifies the Pipeline (and underlying estimator) in place, so the fitted results from the previous iterations are all overwritten by the fitted results from the final iteration.

The problem is that I can't figure out where this "parent instance" is being created and how it's being referenced.

The basic setup with a reproducible example of the problem is in this Gist (it's a little too long to just copy and paste here). I added a print statement at the end to illustrate the issue.

Sorry if this is a little vague, but I'm not having an easy time describing it. Hopefully the issue is clear from the example.

回答1:

The problem is that results['0']['rf'] and results['1']['rf'] are in fact the same object. Therefore, when you fit the pipeline in your loop:

results = dict()
for k in features.keys():
    results[k] = dict()
    for m in classifiers.keys():
        print(len(features[k]))
        results[k][m] = classifiers[m].fit(features[k], 'species', iris)

You are re-fitting an already fit pipeline, losing your previous work.

To remedy this, you need to create a new instance of Classifier every time you fit it. One possible way to do this is to change your classifiers dictionary from one containing Classifier instances to one containing the arguments required to create a Classifier:

classifiers = {
    'rf': (RandomForestClassifier, n_estimators=100, oob_score=True, bootstrap=True),
    'ab': (AdaBoostClassifier, n_estimators=50)
}

Now, in your loop you should use a Python idiom known as "tuple unpacking" to unpack the arguments and create a separate Classifier instance for each combination

for k in features:
    results[k] = dict()
    for m in classifiers:
        print(len(features[k]))
        classifier = Classifier(*classifiers[m])
        results[k][m] = classifier.fit(features[k], 'species', iris)

Note that to iterate over the keys of a dictionary, one can simply write for key in dct:, as opposed to for key in dct.keys().