I have a huge dataset of videos that I process using a python script called process.py
. The problem is it takes a lot of time to process all the dataset which contains 6000 videos. So, I came up with the idea of dividing this dataset for example into 4 and copy the same code to different Python scripts (e.g. process1.py
, process2.py
, process3.py
, process3.py
) and run each one on different shells with one portion of the dataset.
My question is would that bring me anything in terms of performance? I have a machine with 10 cores so it would be very beneficial if I could somehow exploit this multicore structure. I heard about multiprocessing
module of Python but unfortunately, I don't know much about it and I didn't write my script considering that I would use its features. Is the idea of starting each script in different shells nonsense? Is there a way to choose which core would be used by each script?
The
multiprocessing
documentation ( https://docs.python.org/2/library/multiprocessing.html) is actually fairly easy to digest. This section (https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers) should be particularly relevantYou definitely do not need multiple copy of the same script. This is an approach you can adopt:
Assume it is the general structure of your existing script (
process.py
).With
multiprocessing
, you can fire the functionconvert_vid
in seperate processes. Here is the general scheme: