Problem: Check a listing of over 1000 urls and get the url return code (status_code).
The script I have works but very slow.
I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses. (i.e:
200 -> www.yahoo.com
404 -> www.badurl.com
...
Input file:Url10.txt
www.example.com
www.yahoo.com
www.testsite.com
....
import requests
with open("url10.txt") as f:
urls = f.read().splitlines()
print(urls)
for url in urls:
url = 'http://'+url #Add http:// to each url (there has to be a better way to do this)
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
Challenges: Improve speed with multiprocessing.
With multiprocessing
But is it not working. I get the following error: (note: I am not sure if I have even implemented this correctly)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(url):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
In
checkurlconnection
function, parameter must beurls
noturl
. else, in the for loop,urls
will point to the global variable, which is not what you want.In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what
multiprocessing
does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named,multiprocessing.dummy
:See here for examples of multiprocessing vs multithreading in Python.