Gensim doc2vec file stream training worse performa

2019-08-21 02:53发布

问题:

Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties.

This is how I used to trin my doc2vec:

training_iterations = 20
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)

for epoch in range(training_iterations):
    d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
    d2v.alpha -= 0.0002
    d2v.min_alpha = d2v.alpha

And it is classifying documents quite well, only draw back is that when it is trained CPUs are utilized at 70%

So the new way:

corpus_fname = "spped.data"
save_as_line_sentence(corpus, corpus_fname)

# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()

# Train models using all cores
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores, dm=0, vector_size=200, epochs=50)

Now all CPUs are utilized at 100%

but the model is performing very poorly. According to the documentation, I should not use the train method also, I should use only epoch count and not iterations, also the min_aplpha and aplha values should not be touched.

The configuration of both Doc2Vec looks the same to me so is there an issue with my new set up or configuration, or there is something wrong with the new version of gensim?

P.S I am using the same corpus in both cases, also I tried epoch count = 100, also with smaller numbers like 5-20, but I had no luck

EDIT: First model was doing 20 iterations 5 epoch each, second was doing 50 epoch, so having the second model make 100 epochs made it perform even better, since I was no longer managing the alpha by myself.

About the second issue that popped up: when providing file with line documents, the doc ids were not always corresponding to the lines, I didn't manage to figure out what could be causing this, it seems to work fine for small corpus, If I find out what I am doing wrong I will update this answer.

The final configuration for corpus of size 4GB looks like this

    d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
    d2v.build_vocab(corpus)
    d2v.train(corpus, total_examples=d2v.corpus_count, epochs=100)

回答1:

Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.

Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)

Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.

Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.

You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)

If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.