What are doc2vec training iterations?

I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.

However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop? Please let me know how I should change the following code to train the model for 20 epoches.

Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.

# Import libraries
from gensim.models import doc2vec
from collections import namedtuple

# Load data
doc1 = ["This is a sentence", "This is another sentence"]

# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors
model.docvecs[0]
model.docvecs[1]

标签： python deep-learning word2vec gensim doc2vec

1条回答

劳资没心，怎么记你

2楼-- · 2019-07-21 10:38

Word2Vec and related algorithms (like 'Paragraph Vectors' aka Doc2Vec) usually make multiple training passes over the text corpus.

Gensim's Word2Vec/Doc2Vec allows the number of passes to be specified by the iter parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs to the Doc2Vec(docs, ...) constructor call.)

If unspecified, the default iter value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.

Published Doc2Vec work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec initialization to:

model = doc2vec.Doc2Vec(docs, iter=20, ...)

Because Doc2Vec often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec corpus, it's thinkable to use fewer than the default-number of passes.)

You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab() and train() steps yourself, instead of the easier step of supplying the docs corpus in the initializer call to trigger immediate training, then you must supply an epochs argument to train() – and it will perform that number of passes, so you still only need one call to train().

0人赞添加讨论(0) 举报

What are doc2vec training iterations?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间