How to get cython and gensim to work with pyspark

2019-09-16 11:27发布

I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model, it is only ever trained with one worker which is dreadfully slow.

As I said gcc was installed from the start. I then maybe made the mistake and installed gensim before cython. I corrected that by forcing a reinstall of gensim via pip. With no effect still just one worker.

The machine is setup as a spark master and I interface with spark via pyspark. It works something like this, pyspark uses jupyter and jupyter uses python 3.5. This way I get a jupyter interface to my cluster. Now I have no idea if this is the reason why i cant get gensim to work with cython. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter to also do gensim.

标签： python python-3.x pyspark cython gensim

2条回答

等我变得足够好

2楼-- · 2019-09-16 11:54

After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2019-09-16 12:06

You probably did this, but could you please check that you are using the parallel Cythonised version by assert gensim.models.doc2vec.FAST_VERSION > -1 ?

The gensim doc2vec code is parallelized but unfortunately the I/O code that is outside of Gensim isn't. For example, in the github issue you linked parallelization is indeed achieved after the corpus is loaded into RAM by doclist = [doc for doc in documents]

0人赞添加讨论(0) 举报

How to get cython and gensim to work with pyspark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间