I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model.
I take the new documents, and perform preproecssing as normal:
stoplist = nltk.corpus.stopwords.words('english')
train_corpus= []
for i, document in enumerate(corpus_update['body'].values.tolist()):
train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i]))
I then load the original model, update the vocabulary, and retrain:
#### Original model
## model = gensim.models.doc2vec.Doc2Vec(dm=0, size=300, hs=1, min_count=10, dbow_words= 1, negative=5, workers=cores)
model = Doc2Vec.load('pvdbow_model_6_06_12_17.doc2vec')
model.build_vocab(train_corpus, update=True)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
I then update the training set Pandas dataframe by appending the new data, and reset the index.
corpus = corpus.append(corpus_update)
corpus = corpus.reset_index(drop=True)
However, when I try to use infer_vector() with the updated model:
inferred_vector = model1.infer_vector(tokens)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
the result quality is poor, suggesting that the indices from the model and the training set dataframe no longer match.
When I compare it against the non-updated training set dataframe (again using the updated model) the results are fine - though, obviously I'm missing the new documents.
Is there anyway to have both updated, as I want to be able to make frequent updates to the model without a full retrain of the model?
Gensim
Doc2Vec
doesn't yet have official support for expanding-the-vocabulary (viabuild_vocab(..., update=True)
), so the model's behavior here is not defined to do anything useful. In fact, I think any existing doc-tags will be completely discarded and replaced with any in the latest corpus. (Additionally, there are outstanding unresolved reports of memory-fault process-crashes when trying to useupdate_vocab()
withDoc2Vec
, such as this issue.)Even if that worked, there are a number of murky balancing issues to consider if ever continuing to call
train()
on a model with texts different than the initial training-set. In particular, each such training session will nudge the model to be better on the new examples, but lose value of the original training, possibly making the model worse for some cases or overall.The most defensible policy with a growing corpus would be to occasionally retrain from scratch with all training examples combined into one corpus. Another outline of a possible process for rolling updates to a model was discussed in my recent post to the gensim discussion list.
A few other comments on your setup:
using both hierarchical-softmax (
hs=1
) and negative sampling (withnegative
> 0) increases the model size and training time, but may not offer any advantage compared to using just one mode with more iterations (or other tweaks) – so it's rare to have both modes activeby not specifying an
iter
, you're using the default-inherited-from-Word2Vec of '5', while published Doc2Vec work often uses 10-20 or more iterationsmany report
infer_vector
working better with a much-higher value for its optional parametersteps
(which has a default of only5
), and/or with smaller values ofalpha
(which has a default of0.1
)