How to get cython and gensim to work with pyspark

2019-09-16 11:46发布

问题:

I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model, it is only ever trained with one worker which is dreadfully slow.

As I said gcc was installed from the start. I then maybe made the mistake and installed gensim before cython. I corrected that by forcing a reinstall of gensim via pip. With no effect still just one worker.

The machine is setup as a spark master and I interface with spark via pyspark. It works something like this, pyspark uses jupyter and jupyter uses python 3.5. This way I get a jupyter interface to my cluster. Now I have no idea if this is the reason why i cant get gensim to work with cython. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter to also do gensim.

回答1:

After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.



回答2:

You probably did this, but could you please check that you are using the parallel Cythonised version by assert gensim.models.doc2vec.FAST_VERSION > -1 ?

The gensim doc2vec code is parallelized but unfortunately the I/O code that is outside of Gensim isn't. For example, in the github issue you linked parallelization is indeed achieved after the corpus is loaded into RAM by doclist = [doc for doc in documents]