I am using Pipeline
from sklearn to classify text.
In this example Pipeline
, I have a TfidfVectorizer
and some custom features wrapped with FeatureUnion
and a classifier as the Pipeline
steps, I then fit the training data and do the prediction:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.
Here I need to pickle the TfidfVectorizer
step and leave the custom_features
unpickled, since I still do experiments with them. The idea is to make the pipeline faster by pickling the tfidf step.
I know I can pickle the whole Pipeline
with joblib.dump
, but how do I pickle individual steps?