How to pickle individual steps in sklearn's Pi

2019-05-29 07:57发布

问题:

I am using Pipeline from sklearn to classify text.

In this example Pipeline, I have a TfidfVectorizer and some custom features wrapped with FeatureUnion and a classifier as the Pipeline steps, I then fit the training data and do the prediction:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), 
       ('custom_features', CustomFeatures())])),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

Here I need to pickle the TfidfVectorizer step and leave the custom_features unpickled, since I still do experiments with them. The idea is to make the pipeline faster by pickling the tfidf step.

I know I can pickle the whole Pipeline with joblib.dump, but how do I pickle individual steps?

回答1:

To pickle the TfidfVectorizer, you could use:

joblib.dump(pipeline.steps[0][1].transformer_list[0][1], dump_path)

or:

joblib.dump(pipeline.get_params()['features__tfidf'], dump_path)

To load the dumped object, you can use:

pipeline.steps[0][1].transformer_list[0][1] = joblib.load(dump_path)

Unfortunately you can't use set_params, the inverse of get_params, to insert the estimator by name. You will be able to if the changes in PR#1769: enable setting pipeline components as parameters are ever merged!