I'm running a Lubuntu 16.04 Machine with gcc
installed. I'm not getting gensim
to work with cython
because when I train a doc2vec model
, it is only ever trained with one worker which is dreadfully slow.
As I said gcc
was installed from the start. I then maybe made the mistake and installed gensim
before cython
. I corrected that by forcing a reinstall of gensim
via pip
. With no effect still just one worker.
The machine is setup as a spark
master and I interface with spark
via pyspark
. It works something like this, pyspark
uses jupyter
and jupyter
uses python 3.5. This way I get a jupyter
interface to my cluster. Now I have no idea if this is the reason why i cant get gensim
to work with cython
. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter
to also do gensim
.
After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.
You probably did this, but could you please check that you are using the parallel Cythonised version by
assert gensim.models.doc2vec.FAST_VERSION > -1
?The gensim doc2vec code is parallelized but unfortunately the I/O code that is outside of Gensim isn't. For example, in the github issue you linked parallelization is indeed achieved after the corpus is loaded into RAM by
doclist = [doc for doc in documents]