I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.
However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter
or alpha
parameter or do we have to train it in a seperate for loop
? Please let me know how I should change the following code to train the model for 20 epoches.
Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
Word2Vec
and related algorithms (like 'Paragraph Vectors' akaDoc2Vec
) usually make multiple training passes over the text corpus.Gensim's
Word2Vec
/Doc2Vec
allows the number of passes to be specified by theiter
parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplyingdocs
to theDoc2Vec(docs, ...)
constructor call.)If unspecified, the default
iter
value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.Published
Doc2Vec
work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change yourDoc2Vec
initialization to:Because
Doc2Vec
often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in aWord2Vec
corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, variedWord2Vec
corpus, it's thinkable to use fewer than the default-number of passes.)You don't need to do your own loop, and most users shouldn't. If you do manage the separate
build_vocab()
andtrain()
steps yourself, instead of the easier step of supplying thedocs
corpus in the initializer call to trigger immediate training, then you must supply anepochs
argument totrain()
– and it will perform that number of passes, so you still only need one call totrain()
.