I'm making over 100K calls to an api, using 2 functions I reach out to the api with the first function and grab the sysinfo(a dict) for each host, then with the second function I go through sysinfo and grab the IP addresses. I'm looking for a way to speed this up but never used multiprocessing/threading before(currently takes about 3 hours).
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
#pool = ThreadPool(4)
p = Pool(5)
#obviously I removed a lot of the code that generates some of these
#variables, but this is the part that slooooows everything down.
def get_sys_info(self, host_id, appliance):
sysinfo = self.hx_request("https://{}:3000//hx/api/v3/hosts/{}/sysinfo"
return sysinfo
def get_ips_from_sysinfo(self, sysinfo):
sysinfo = sysinfo["data"]
network_array = sysinfo.get("networkArray", {})
network_info = network_array.get("networkInfo", [])
ips = []
for ni in network_info:
ip_array = ni.get("ipArray", {})
ip_info = ip_array.get("ipInfo", [])
for i in ip_info:
ips.append(i)
return ips
if __name__ == "__main__":
for i in ids:
sysinfo = rr.get_sys_info(i, appliance)
hostname = sysinfo.get("data", {}).get("hostname")
try:
ips = p.map(rr.get_ips_from_sysinfo(sysinfo))
except Exception as e:
rr.logger.error("Exception on {} -- {}".format(hostname, e))
continue
#Tried calling it here
ips = p.map(rr.get_ips_from_sysinfo(sysinfo))
I have to go through over 100,000 of these api calls, and this is really the part that slows everything down.
I think I've tried everything and gotten every possible iterable, missing argument error.
I'd just really appreciate any type of help. Thank you!
For whatever reason I was a little leary about calling an instance method in numerous threads - but it seems to work. I made this toy example using concurrent.futures - hopefully it mimics your actual situation well enough. This submits 4000 instance method calls to a thread pool of (at max) 500 workers. Playing around with the
max_workers
value I found that execution time improvements were pretty linear up to about a 1000 workers then the improvement ratio started tailing off.I didn't account for possible Exceptions being thrown during the method call but the example in the docs is pretty clear how to handle that.
So... after days of looking at the suggestions on here(thank you so much!!!) And a couple outside reading (Fluent Python Ch 17 and Effective Python 59 Specific Ways..)
*Modified this works now, hope it helps someone else
you can use threads and queue to communicate, first you will start
get_ips_from_sysinfo
as a single thread to monitor and process any finishedsysinfo
which will store output inoutput_list
then fire allget_sys_info
threads, be careful not to run out of memory with 100k threadsAs @wwii commented,
concurrent.futures
offer some conveniences that you may help you, particularly since this looks like a batch job.It appears that your performance hit is most likely to come from the network calls so multithreading is probably more suitable for your use case (here is a comparison with multiprocessing). If not, you can switch the pool from threads to processes while using the same APIs.
You can streamline this example by refactoring your methods into functions, if indeed they don't make use of state as seems to be the case in your code.
If extracting the
sysinfo
data is expensive, you can enqueue the results and in turn feed those to aProcessPoolExecutor
that callsget_ips_from_sysinfo
on the queued dicts.