When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,
sleep(10)
response = requests.get(url)
and,
response = requests.get(url, timeout=10)
That is, timeout
is much faster.
Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.
- Why is there such a difference in speed?
- Why is the scraping duration per page less than 10 seconds?
I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.
time.sleep
stops your script from running for certain amount of seconds, while the timeout
is the maximum time wait for retrieving the url. If the data is retrieved before the timeout
time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout
.
time.sleep
is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep
will take more than 10 seconds every time.
They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.
response = requests.get(url, timeout=10)
# timeout specifies the maximum time program will wait for request to complete before throwing exception. It is not necessary that program will pause for 10 seconds. If response is returned early the program won't wait anymore.
Read more about requests
timeout here.
time.sleep
cause your main thread to sleep , so your program will always wait for 10 seconds always before making a request to the url.