Python multiprocessing.Pool & memory

2019-08-13 14:55发布

I'm using Pool.map for a scoring procedure:

"cursor" with millions of arrays from a data source
calculation
save the result in a data sink

The results are independent.

I'm just wondering if I can avoid the memory demand. At first it seems that every array goes into python and then the 2 and 3 are proceed. Anyway I have a speed improvement.

#data src and sink is in mongodb#
def scoring(some_arguments):
        ### some stuff  and finally persist  ###
    collection.update({uid:_uid},{'$set':res_profile},upsert=True)


cursor = tracking.find(timeout=False)
score_proc_pool = Pool(options.cores)    
#finaly I use a wrapper so I have only the document as input for map
score_proc_pool.map(scoring_wrapper,cursor,chunksize=10000)

Am I doing something wrong or is there a better way with python for this purpose?

标签： python memory-management multiprocessing pool

1条回答

ら.Afraid

2楼-- · 2019-08-13 15:43

The map functions of a Pool internally convert the iterable to a list if it doesn't have a __len__ attribute. The relevant code is in Pool.map_async, as that is used by Pool.map (and starmap) to produce the result - which is also a list.

If you don't want to read all data into memory first, you should use Pool.imap or Pool.imap_unordered, which will produce an iterator that will yield the results as they come in.

0人赞添加讨论(0) 举报

Python multiprocessing.Pool & memory

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间