I have a script that runs through a directory and searches all files with a given ending (i.e. .xml) for given strings and replaces them. To achieve this I used the python multiprocessing library.
As an example I am using 1100 .xml files with around 200MB of data. The complete execution time is 8 minutes on my MBP '15 15".
But after some minutes, process for process is going to sleep which I see in "top" (here after 7m...).
top output
PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPR PGRP PPID STATE BOOSTS %CPU_ME %CPU_OTHRS
1007 Python 0.0 07:03.51 1 0 7 5196K 0B 0B 998 998 sleeping *0[1] 0.00000 0.00000
1006 Python 99.8 07:29.07 1/1 0 7 4840K 0B 0B 998 998 running *0[1] 0.00000 0.00000
1005 Python 0.0 02:10.02 1 0 7 4380K 0B 0B 998 998 sleeping *0[1] 0.00000 0.00000
1004 Python 0.0 04:24.44 1 0 7 4624K 0B 0B 998 998 sleeping *0[1] 0.00000 0.00000
1003 Python 0.0 04:25.34 1 0 7 4572K 0B 0B 998 998 sleeping *0[1] 0.00000 0.00000
1002 Python 0.0 04:53.40 1 0 7 4612K 0B 0B 998 998 sleeping *0[1] 0.00000 0.00000
So now only one process is doing all the work while the others went asleep after 4 minutes.
Code snippet
# set cpu pool to cores in computer
pool_size = multiprocessing.cpu_count()
# create pool
pool = multiprocessing.Pool(processes=pool_size)
# give pool function and input data - here for each file in file_list
pool_outputs = pool.map(check_file, file_list)
# if no more tasks are available: close all
pool.close()
pool.join()
So why are all processes going asleep?
My guess: The file list is separated to all Workers in the Pool (same amount each) and a fews are just "lucky" to get the small files - and therefore finish earlier. Can this be true? I Was just thinking that it works more like a Queue so that every worker gets a new file when it is finished - until the list is empty.