I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:
urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.
My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like
while i<100000
while number_of_subproccesses < 20:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
if number_of_subprocesses == 20:
sleep() # wait to some time until check again
Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?
I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.
Maybe someone can help me and lead me into the right direction. Thanks Alot!
Taking a different approach from the above - as it seems that the callback can't be sent as a parameter:
NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000 # Note this would be better to be len(urllist)
Processes = []
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global Processes
if NextURLNo < MaxUrls:
proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print ("Started to Process %s", urllist[NextURLNo])
NextURLNo += 1
Processes.append(proc)
def CheckRunning():
""" Check any running processes and start new ones if there are spare slots."""
global Processes
global NextURLNo
for p in range(len(Processes):0:-1): # Check the processes in reverse order
if Processes[p].poll() is not None: # If the process hasn't finished will return None
del Processes[p] # Remove from list - this is why we needed reverse order
while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
StartNew()
if __name__ == "__main__":
CheckRunning() # This will start the max processes running
while (len(Processes) > 0): # Some thing still going on.
time.sleep(0.1) # You may wish to change the time for this
CheckRunning()
print ("Done!")
Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.
e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:
NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global NoSubProcess
if NextURLNo < MaxUrls:
subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print "Started to Process", urllist[NextURLNo]
NextURLNo += 1
NoSubProcess += 1
def OnExit():
NoSubProcess -= 1
if __name__ == "__main__":
for n in range(MaxProcesses):
StartNew()
while (NoSubProcess > 0):
time.sleep(1)
if (NextURLNo < MaxUrls):
for n in range(NoSubProcess,MaxProcesses):
StartNew()
To keep constant number of concurrent requests, you could use a thread pool:
#!/usr/bin/env python
from multiprocessing.dummy import Pool
def process_url(url):
# ... handle a single url
urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
pass
To run processes instead of threads, remove .dummy
from the import.