Updating training documents for gensim Doc2Vec mod

2019-07-24 18:14发布

I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model.

I take the new documents, and perform preproecssing as normal:

stoplist = nltk.corpus.stopwords.words('english')
train_corpus= []
for i, document in enumerate(corpus_update['body'].values.tolist()):
     train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i]))

I then load the original model, update the vocabulary, and retrain:

#### Original model
## model = gensim.models.doc2vec.Doc2Vec(dm=0, size=300, hs=1, min_count=10, dbow_words= 1, negative=5, workers=cores)

model = Doc2Vec.load('pvdbow_model_6_06_12_17.doc2vec')

model.build_vocab(train_corpus, update=True)

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

I then update the training set Pandas dataframe by appending the new data, and reset the index.

corpus = corpus.append(corpus_update)
corpus = corpus.reset_index(drop=True)

However, when I try to use infer_vector() with the updated model:

inferred_vector = model1.infer_vector(tokens)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

the result quality is poor, suggesting that the indices from the model and the training set dataframe no longer match.

When I compare it against the non-updated training set dataframe (again using the updated model) the results are fine - though, obviously I'm missing the new documents.

Is there anyway to have both updated, as I want to be able to make frequent updates to the model without a full retrain of the model?

1条回答
劳资没心,怎么记你
2楼-- · 2019-07-24 19:10

Gensim Doc2Vec doesn't yet have official support for expanding-the-vocabulary (via build_vocab(..., update=True)), so the model's behavior here is not defined to do anything useful. In fact, I think any existing doc-tags will be completely discarded and replaced with any in the latest corpus. (Additionally, there are outstanding unresolved reports of memory-fault process-crashes when trying to use update_vocab() with Doc2Vec, such as this issue.)

Even if that worked, there are a number of murky balancing issues to consider if ever continuing to call train() on a model with texts different than the initial training-set. In particular, each such training session will nudge the model to be better on the new examples, but lose value of the original training, possibly making the model worse for some cases or overall.

The most defensible policy with a growing corpus would be to occasionally retrain from scratch with all training examples combined into one corpus. Another outline of a possible process for rolling updates to a model was discussed in my recent post to the gensim discussion list.

A few other comments on your setup:

  • using both hierarchical-softmax (hs=1) and negative sampling (with negative > 0) increases the model size and training time, but may not offer any advantage compared to using just one mode with more iterations (or other tweaks) – so it's rare to have both modes active

  • by not specifying an iter, you're using the default-inherited-from-Word2Vec of '5', while published Doc2Vec work often uses 10-20 or more iterations

  • many report infer_vector working better with a much-higher value for its optional parameter steps (which has a default of only 5), and/or with smaller values of alpha (which has a default of 0.1)

查看更多
登录 后发表回答