I have a Python script that I want to use as a controller to another Python script. I have a server with 64 processors, so want to spawn up to 64 child processes of this second Python script. The child script is called:
$ python create_graphs.py --name=NAME
where NAME is something like XYZ, ABC, NYU etc.
In my parent controller script I retrieve the name variable from a list:
my_list = [ 'XYZ', 'ABC', 'NYU' ]
So my question is, what is the best way to spawn off these processes as children? I want to limit the number of children to 64 at a time, so need to track the status (if the child process has finished or not) so I can efficiently keep the whole generation running.
I looked into using the subprocess package, but rejected it because it only spawns one child at a time. I finally found the multiprocessor package, but I admit to being overwhelmed by the whole threads vs. subprocesses documentation.
Right now, my script uses subprocess.call
to only spawn one child at a time and looks like this:
#!/path/to/python
import subprocess, multiprocessing, Queue
from multiprocessing import Process
my_list = [ 'XYZ', 'ABC', 'NYU' ]
if __name__ == '__main__':
processors = multiprocessing.cpu_count()
for i in range(len(my_list)):
if( i < processors ):
cmd = ["python", "/path/to/create_graphs.py", "--name="+ my_list[i]]
child = subprocess.call( cmd, shell=False )
I really want it to spawn up 64 children at a time. In other stackoverflow questions I saw people using Queue, but it seems like that creates a performance hit?
I don't think you need queue unless you intend to get data out of the applications (Which if you do want data, I think it may be easier to add it to a database anyway)
but try this on for size:
put the contents of your create_graphs.py script all into a function called "create_graphs"
I know that this will result in 1 less threads than processors, which is probably good, it leaves a processor to manage the threads, disk i/o, and other things happening on the computer. If you decide you want to use the last core just add one to it
edit: I think I may have misinterpreted the purpose of my_list. You do not need
my_list
to keep track of the threads at all (as they're all referenced by the items in thethreads
list). But this is a fine way of feeding the processes input - or even better: use a generator function ;)The purpose of
my_list
andthreads
my_list
holds the data that you need to process in your functionthreads
is just a list of the currently running threadsthe while loop does two things, start new threads to process the data, and check if any threads are done running.
So as long as you have either (a) more data to process, or (b) threads that aren't finished running.... you want to program to continue running. Once both lists are empty they will evaluate to
False
and the while loop will exitI would definitely use multiprocessing rather than rolling my own solution using subprocess.
Here is the solution I came up, based on Nadia and Jim's comments. I am not sure if it is the best way, but it works. The original child script being called needs to be a shell script because I need to use some 3rd party apps including Matlab. So I had to take it out of Python and code it in bash.
Does this seem like a reasonable solution? I tried to use Jim's while loop format, but my script just returned nothing. I am not sure why that would be. Here is the output when I run the script with Jim's 'while' loop replacing the 'for' loop:
When I run it with the 'for' loop, I get something more meaningful:
So this works, and I am happy. However, I still don't get why I can't use Jim's 'while' style loop instead of the 'for' loop I am using. Thanks for all the help - I am impressed with the breadth of knowledge @ stackoverflow.
What you are looking for is the process pool class in multiprocessing.
And here is a calculation example to make it easier to understand. The following will divide 10000 tasks on N processes where N is the cpu count. Note that I'm passing None as the number of processes. This will cause the Pool class to use cpu_count for the number of processes (reference)