How to get cython and gensim to work with pyspark

2019-09-16 11:27发布

I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model, it is only ever trained with one worker which is dreadfully slow.

As I said gcc was installed from the start. I then maybe made the mistake and installed gensim before cython. I corrected that by forcing a reinstall of gensim via pip. With no effect still just one worker.

The machine is setup as a spark master and I interface with spark via pyspark. It works something like this, pyspark uses jupyter and jupyter uses python 3.5. This way I get a jupyter interface to my cluster. Now I have no idea if this is the reason why i cant get gensim to work with cython. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter to also do gensim.

2条回答
等我变得足够好
2楼-- · 2019-09-16 11:54

After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.

查看更多
聊天终结者
3楼-- · 2019-09-16 12:06

You probably did this, but could you please check that you are using the parallel Cythonised version by assert gensim.models.doc2vec.FAST_VERSION > -1 ?

The gensim doc2vec code is parallelized but unfortunately the I/O code that is outside of Gensim isn't. For example, in the github issue you linked parallelization is indeed achieved after the corpus is loaded into RAM by doclist = [doc for doc in documents]

查看更多
登录 后发表回答