Python web scraping: difference between sleep and

2020-08-01 07:18发布

问题:

When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,

sleep(10)
response = requests.get(url)

and,

response = requests.get(url, timeout=10)

That is, timeout is much faster.

Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.

  1. Why is there such a difference in speed?
  2. Why is the scraping duration per page less than 10 seconds?

I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.

回答1:

time.sleep stops your script from running for certain amount of seconds, while the timeout is the maximum time wait for retrieving the url. If the data is retrieved before the timeout time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout.

time.sleep is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep will take more than 10 seconds every time.

They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.



回答2:

response = requests.get(url, timeout=10)
# timeout specifies the maximum time program will wait for request to complete before throwing exception. It is not necessary that program will pause for 10 seconds. If response is returned early the program won't wait anymore.

Read more about requests timeout here.

time.sleep cause your main thread to sleep , so your program will always wait for 10 seconds always before making a request to the url.