I am using scikit-learn and am building a pipeline. Once the pipeline is build, I am using GridSearchCV to find the optimal model. I am working with text data, so I am experimenting with different stemmers. I have created a class called Preprocessor that takes a stemmer and vectorizer class, then attempts to override the vectorizer's method build_analyzer to incorporate the given stemmer. However, I see that GridSearchCV's set_params just directly accesses instance variables -- i.e. it will not re-instantiate a vectorizer with a new analyzer, as I have been doing it:
class Preprocessor(object):
# hard code the stopwords for now
stopwords = nltk.corpus.stopwords.words()
def __init__(self, stemmer_cls, vectorizer_cls):
self.stemmer = stemmer_cls()
analyzer = self._build_analyzer(self.stemmer, vectorizer_cls)
self.vectorizer = vectorizer_cls(stopwords=stopwords,
analyzer=analyzer,
decode_error='ignore')
def _build_analyzer(self, stemmer, vectorizer_cls):
# analyzer tokenizes and lowercases
analyzer = super(vectorizer_cls, self).build_analyzer()
return lambda doc: (stemmer.stem(w) for w in analyzer(doc))
def fit(self, **kwargs):
return self.vectorizer.fit(kwargs)
def transform(self, **kwargs):
return self.vectorizer.transform(kwargs)
def fit_transform(self, **kwargs):
return self.vectorizer.fit_transform(kwargs)
So the question is: how can I override a build_analyzer for all vectorizer classes passed in?