I'm an R user looking to get more comfortable with Python. I wrote a kind of mini-API that makes it easy to compare different statistical models fitted to the same data, in such a way that I can pre-set all the model hyperparameters and then iterate over the different models in order to fit them.
This is the essence of what I want to do:
- Build a wrapper class
Classifier
around a Scikit-learnPipeline
, in turn built on one of Scikit-learn's built-in estimators, e.g.RandomForestClassifier
- Create a dictionary of these un-fitted
Classifier
s, and a different dictionary of parameters to loop over - Iterate over both dictionaries, have each un-fitted
Classifier
generate a new instance of the underlying Pipeline, fit it using its[Pipeline.fit][1]
method, and save the new, fitted Pipeline in a different dictionary
However, it seems that, instead of generating a new instance of the Pipeline, in each iteration, the same instance of the Pipeline (or maybe the underlying estimator) is being refitted. This is a problem because the Pipeline.fit
method modifies the Pipeline (and underlying estimator) in place, so the fitted results from the previous iterations are all overwritten by the fitted results from the final iteration.
The problem is that I can't figure out where this "parent instance" is being created and how it's being referenced.
The basic setup with a reproducible example of the problem is in this Gist (it's a little too long to just copy and paste here). I added a print statement at the end to illustrate the issue.
Sorry if this is a little vague, but I'm not having an easy time describing it. Hopefully the issue is clear from the example.
The problem is that
results['0']['rf']
andresults['1']['rf']
are in fact the same object. Therefore, when you fit the pipeline in your loop:You are re-fitting an already fit pipeline, losing your previous work.
To remedy this, you need to create a new instance of
Classifier
every time you fit it. One possible way to do this is to change yourclassifiers
dictionary from one containingClassifier
instances to one containing the arguments required to create aClassifier
:Now, in your loop you should use a Python idiom known as "tuple unpacking" to unpack the arguments and create a separate
Classifier
instance for each combinationNote that to iterate over the keys of a dictionary, one can simply write
for key in dct:
, as opposed tofor key in dct.keys()
.