Errors encountered in partial_fit in scikit learn

2019-04-06 11:25发布

站内文章 / Python

28 0

做自己的国王

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

On training with a partial_fit function in scikit learn I get the following error without the program terminating , how is that possible and what are the repurcussions of this even though the trained model behaves correctly and gives correct output. Is this something to worry about?

/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py:207: RuntimeWarning: divide by zero encountered in log
  self.class_log_prior_ = (np.log(self.class_count_)

I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data:

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets,classes=None, partial=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))
        X = self._vectorizer.fit_transform(X)
        y = self._encoder.fit_transform(y)



        if partial:
            classes=self._encoder.fit_transform(list(set(classes)))
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
            self._clf.fit(X, y)

        return self

Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial):

self._clf.partial_fit(X, y, classes=list(set(classes)))
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 277, in partial_fit
    self._count(X, Y)
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 443, in _count
    self.feature_count_ += safe_sparse_dot(Y.T, X)
ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430)

Where am I going wrong based on the error thrown? Does it mean that I am pushing in incorrect dimensioned data ? I tried following , I am now calling :

        X = self._vectorizer.transform(X)
        y = self._encoder.transform(y)

each time the partial fit is called. Earlier I used fittransform for each partialfit call. Is this correct

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets, classes=None, partial=False):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:
            classes = self._encoder.fit_transform(np.unique(classes))
            X = self._vectorizer.transform(X)
            y = self._encoder.transform(y)
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self._clf

After many tries I was able to get the following code working, by accounting for first call but I had assumed that the classifier pickled files would be increasing in size after each iteration but I am getting the same sized pkl file for each batch which is not possible:

 class MySklearnClassifier(SklearnClassifier):

    def train(self, labeled_featuresets, classes=None, partial=False,firstcall=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:

           if firstcall:
                classes = self._encoder.fit_transform(np.unique(classes))
                X = self._vectorizer.fit_transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y, classes=classes)
           else:

                X = self._vectorizer.transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y)
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self

here is the entire code:

class postagger(ClassifierBasedTagger):
    """
    A classifier based postagger.
    """
    #MySklearnClassifier()
    def __init__(self, feature_detector=None, train=None,estimator=None,

                 classifierinstance=None, backoff=None,
                 cutoff_prob=None, verbose=True):

        if backoff is None:
            self._taggers = [self]
        else:
            self._taggers = [self] + backoff._taggers
        if estimator:
            classifier = MySklearnClassifier(estimator=estimator)
            #MySklearnClassifier.__init__(self, classifier)
        elif classifierinstance:
            classifier=classifierinstance






        if feature_detector is not None:
            self._feature_detector = feature_detector
            # The feature detector function, used to generate a featureset
            # or each token: feature_detector(tokens, index, history) -> featureset

        self._cutoff_prob = cutoff_prob
        """Cutoff probability for tagging -- if the probability of the
           most likely tag is less than this, then use backoff."""

        self._classifier = classifier
        """The classifier used to choose a tag for each token."""

        # if train and picklename:
        #     self._train(classifier_builder, picklename,tagged_corpus=train, ONLYERRORS=False,verbose=True,onlyfeatures=True ,LOADCONSTRUCTED=None)

    def legacy_getfeatures(self, tagged_corpus=None, ONLYERRORS=False, existingfeaturesetfile=None, verbose=True,
                           labels=artlabels):

        featureset = []
        labels=artlabels
        if not existingfeaturesetfile and tagged_corpus:
            if ONLYERRORS:

                classifier_corpus = open(tagged_corpus + '-ONLYERRORS.richfeature', 'w')
            else:
                classifier_corpus = open(tagged_corpus + '.richfeature', 'w')

            if verbose:
                print('Constructing featureset  for training corpus for classifier.')
            nlp = English()
            #df=pandas.DataFrame()
            store = HDFStore('featurestore.h5')



            for sentence in sPickle.s_load(open(tagged_corpus,'r')):
                untagged_words, tags, senindex = zip(*sentence)
                doc = nlp(u' '.join(untagged_words))
                # untagged_sentence, tags , rest = unpack_three(*zip(*sentence))
                for index in range(len(sentence)):
                    if ONLYERRORS:
                        if tags[index] == '<!SAME!>' and random.random() < 0.05:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)
                            # classifier_corpus.append((featureset, tags[index]))
                        elif tags[index] in labels:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)


                        # classifier_corpus.append((featureset, tags[index]))
        # else:
        #     for element in sPickle.s_load(open(existingfeaturesetfile, 'w')):
        #         featureset.append( element)

        return tagged_corpus + '.richfeature'

    def _train(self, featuresetdata, classifier_builder=MultinomialNB(), partial=False, batchsize=500):
        """
        Build a new classifier, based on the given training data
        *tagged_corpus*.

        """



        #labels = set(cPickle.load(open(arguments['-k'], 'r')))
        if partial==False:
           print('Training classifier FULLMODE')
           featureset = []
           for element in sPickle.s_load(open(featuresetdata, 'r')):
               featureset.append(element)

           model = self._classifier.train(featureset, classes=artlabels, partial=False,firstcall=True)
           print('Training complete, dumping')
           try:
            joblib.dump(model,  str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.mpkl')
            print "joblib dumped"
           except:
               print "joblib error"
           cPickle.dump(model, open(str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.cmpkl', 'w'))
           print('dumped')
           return
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

        print('Training classifier each batch of {} training points'.format(batchsize))

        for i, batchelement in enumerate(batch(sPickle.s_load(open(featuresetdata, 'r')), batchsize)):
            featureset = []
            for element in batchelement:
                featureset.append(element)



            # model =  super(postagger, self).train (featureset, partial)
            # pdb.set_trace()
            # featureset = [item for sublist in featureset for item in sublist]
            trainsize = len(featureset)
            print("submitting {} training points for training\neg last one:".format(trainsize))
            for d, l in featureset:
                if len(d) != 113:
                    print d
                    assert False

            print featureset[-1]
            # pdb.set_trace()
            try:
                if i==0:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=True)
                else:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=False)

            except:
                type, value, tb = sys.exc_info()
                traceback.print_exc()
                pdb.post_mortem(tb)

            print('Training for batch {} complete, dumping'.format(i))
            cPickle.dump(model, open(
                str(featuresetdata) + '-' + slugify(str(classifier_builder))[
                                            :10] + 'UPDATED batch-{} of {} points.mpkl'.format(
                    i, trainsize), 'w'))
            print('dumped')
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

    def untag(self,tagged_sentence):
        """
        Given a tagged sentence, return an untagged version of that
        sentence.  I.e., return a list containing the first element
        of each tuple in *tagged_sentence*.

            >>> from nltk.tag.util import untag
            >>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
            ['John', 'saw', 'Mary']

        """

        return [w[0] for w in tagged_sentence]

    def evaluate(self, gold):
        """
        Score the accuracy of the tagger against the gold standard.
        Strip the tags from the gold standard text, retag it using
        the tagger, then compute the accuracy score.

        :type gold: list(list(tuple(str, str)))
        :param gold: The list of tagged sentences to score the tagger on.
        :rtype: float
        """
        gold_tokens=[]
        full_gold_tokens=[]

        tagged_sents = self.tag_sents(self.untag(sent) for sent in gold)
        for sentence in gold:#flatten the list

            untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature = zip(*sentence)


            gold_tokens.extend(zip(untagged_sentences,goldtags))
            full_gold_tokens.extend(zip( untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature))





        test_tokens = sum(tagged_sents, []) #flatten the list
        getmismatch(gold_tokens,test_tokens,full_gold_tokens)
        return accuracy(gold_tokens, test_tokens)

    #
    def new_feature_detector(self, tokens, index):
        return getfeatures(tokens, index)


    def tag_sents(self, sentences):
        """
        Apply ``self.tag()`` to each element of *sentences*.  I.e.:

            return [self.tag(sent) for sent in sentences]
        """
        return [self.tag(sent) for sent in sentences]

    def tag(self, tokens):
        # docs inherited from TaggerI
        tags = []
        for i in range(len(tokens)):
            tags.append(self.tag_one(tokens, i))
        return list(zip(tokens, tags))

    def tag_one(self, tokens, index):
        """
        Determine an appropriate tag for the specified token, and
        return that tag.  If this tagger is unable to determine a tag
        for the specified token, then its backoff tagger is consulted.

        :rtype: str
        :type tokens: list
        :param tokens: The list of words that are being tagged.
        :type index: int
        :param index: The index of the word whose tag should be
            returned.
        :type history: list(str)
        :param history: A list of the tags for all words before *index*.
        """
        tag = None
        for tagger in self._taggers:
            tag = tagger.choose_tag(tokens, index)
            if tag is not None:  break
        return tag

    def choose_tag(self, tokens, index):
        # Use our feature detector to get the featureset.
        featureset = self.new_feature_detector(tokens, index)

        # Use the classifier to pick a tag.  If a cutoff probability
        # was specified, then check that the tag's probability is
        # higher than that cutoff first; otherwise, return None.

        if self._cutoff_prob is None:
            return self._classifier.prob_classify_many([featureset])
            #return self._classifier.classify_many([featureset])


        pdist = self._classifier.prob_classify_many([featureset])
        tag = pdist.max()
        return tag if pdist.prob(tag) >= self._cutoff_prob else None

回答1:

1. The `RuntimeWarning`

You're getting this warning because np.log is called on 0:

In [6]: np.log(0)
/home/anaconda/envs/python34/lib/python3.4/site-packages/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
  if __name__ == '__main__':
Out[6]: -inf

That's because in one of your call, some classes are not represented at all (they have a count of 0) and thus np.log is called on 0. You don't need to worry about it.

2. Class priors

I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data

You are right that you need to pass the list of labels/classes from the start if you're using partial_fit.
I'm unsure about the class prior being the same in each batch of training data. That could have several different meanings, it would be nice if you could clarify what you meant here.
In the mean time, the default behavior for classifiers such as MultinomialNB is that they fit priors to the data (basically they compute frequencies). When using partial_fit, they will do this computation incrementally so that you get the same result as if you had used a single fit call.

3. Your error

Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial)

Here we need more details. I'm confused that X is of shape (n_samples, n_features) and yet in the traceback it appears to be of shape (2000,11430). That means X has 2000 samples.

The error indeed means that the dimensions of your inputs are inconsistent. I would suggest printing X.shape, y.shape after vectorization for each partial_fit call.

Also you should not be calling fit or fit_transform on the vectorizer that transforms X for each partial_fit call: you should fit it once, then just transform X. This is to ensure that you get consistent dimensions for your transformed X.

4. Your 'earlier' solution

Here's the code you told us you were using:

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets, classes=None, partial=False):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:
            classes = self._encoder.fit_transform(np.unique(classes))
            X = self._vectorizer.transform(X)
            y = self._encoder.transform(y)
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self._clf

As far as I can tell there's not much wrong with that, but we really need more context as to how you're using it here.
A nitpick: I feel it'd be clearer if you put the classes variable as a class attribute, since this variable needs to be the same for each partial_fit call.
Here you might be doing something wrong if you pass different values to the constructor classes argument.

More information that could help us help you:

Prints of X.shape, y.shape.
Context: how are you using the code you provided??
What are you using for _vectorizer, _encoder? What classifier are you ultimately working with?

回答2:

The problem here, I believe, can be seen in the error

ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430)

Note the (2000,11430) (2000,10728) (2000,11430), you supply a dataset with a different amount of features, therefor it can not determine the amount of features since they are different in the second set, and this gives the error. It only shows the error and does not crash because the error is most likely catched in a try catch block.

You probably do not want to give a different set of features (attributes) to the partial_fit function. Your algorithm still performs because it fitst on the remaining two chuncks, but the second chunck is most likely ignored.