Running Python script parallel

2020-06-03 07:54发布

I have a huge dataset of videos that I process using a python script called process.py. The problem is it takes a lot of time to process all the dataset which contains 6000 videos. So, I came up with the idea of dividing this dataset for example into 4 and copy the same code to different Python scripts (e.g. process1.py, process2.py, process3.py, process3.py) and run each one on different shells with one portion of the dataset.

My question is would that bring me anything in terms of performance? I have a machine with 10 cores so it would be very beneficial if I could somehow exploit this multicore structure. I heard about multiprocessing module of Python but unfortunately, I don't know much about it and I didn't write my script considering that I would use its features. Is the idea of starting each script in different shells nonsense? Is there a way to choose which core would be used by each script?

1条回答
男人必须洒脱
2楼-- · 2020-06-03 08:20

The multiprocessing documentation ( https://docs.python.org/2/library/multiprocessing.html) is actually fairly easy to digest. This section (https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers) should be particularly relevant

You definitely do not need multiple copy of the same script. This is an approach you can adopt:

Assume it is the general structure of your existing script (process.py).

def convert_vid(fname):
    # do the heavy lifting
    # ...

if __name__ == '__main__':
   # There exists VIDEO_SET_1 to 4, as mentioned in your question
   for file in VIDEO_SET_1:  
       convert_vid(file)

With multiprocessing, you can fire the function convert_vid in seperate processes. Here is the general scheme:

from multiprocessing import Pool

def convert_vid(fname):
    # do the heavy lifting
    # ...

if __name__ == '__main__':
   pool = Pool(processes=4) 
   pool.map(convert_vid, [VIDEO_SET_1, VIDEO_SET_2, VIDEO_SET_3, VIDEO_SET_4]) 
查看更多
登录 后发表回答