I'm currently launching a subprocess and parsing stdout on the go without waiting for it to finish to parse stdout.
for sample in all_samples:
my_tool_subprocess = subprocess.Popen('mytool {}'.format(sample),shell=True, stdout=subprocess.PIPE)
line = True
while line:
myline = my_tool_subprocess.stdout.readline()
#here I parse stdout..
In my script I perform this action multiple times, indeed depending on the number of input samples.
Main problem here is that every subprocess is a program/tool that uses 1 CPU for 100% while it's running. And it takes sometime.. maybe 20-40 min per input.
What I would like to achieve, is to set a pool, queue (I'm not sure what's the exact terminology here) of N max subprocess job process running at same time. So I could maximize performance, and not proceed sequentially.
So an execution flow for example a max 4 jobs pool should be:
- Launch 4 subprocess.
- When one of jobs finishes, parse stdout and launch next.
- Do this until all the jobs in queue are finished.
If I can achieve this I really don't know how I could identify which sample subprocess is the one that has finished. At this moment, I don't need to identify them since each subprocess runs sequentially and I parse stdout as subprocess is printing stdout.
This is really important, since I need to identify the output of each subprocess and assign it to it's corresponding input/sample.
well as i understood your question your problem is that the result of the first process after its finished is supplied to the second process, then to the third and so on, to achieve this you should import threading module and use the class Thread:
start until the previous one has finished.....
well if this is the case you should write the same code above without
proc.join()
in this case the main thread (main) will start the other four threads, this the case that multithreading in a single process (in other words no benefits of multicore processor) to benefit from multicore processor you should use the multiprocessing module like this:this way each would be a separate process and separate processes can run completely independently of one another
ThreadPool
could be a good fit for your problem, you set the number of worker threads and add jobs, and the threads will work their way through all the tasks.