I'm gathering statistics on a list of websites and I'm using requests for it for simplicity. Here is my code:
data=[]
websites=['http://google.com', 'http://bbc.co.uk']
for w in websites:
r= requests.get(w, verify=False)
data.append( (r.url, len(r.content), r.elapsed.total_seconds(), str([(l.status_code, l.url) for l in r.history]), str(r.headers.items()), str(r.cookies.items())) )
Now, I want requests.get
to timeout after 10 seconds so the loop doesn't get stuck.
This question has been of interest before too but none of the answers are clean. I will be putting some bounty on this to get a nice answer.
I hear that maybe not using requests is a good idea but then how should I get the nice things requests offer. (the ones in the tuple)
UPDATE: http://docs.python-requests.org/en/master/user/advanced/#timeouts
In new version of
requests
:If you specify a single value for the timeout, like this:
The timeout value will be applied to both the
connect
and theread
timeouts. Specify a tuple if you would like to set the values separately:If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.
My old (probably outdated) answer (which was posted long time ago):
There are other ways to overcome this problem:
1. Use the
TimeoutSauce
internal classFrom: https://github.com/kennethreitz/requests/issues/1928#issuecomment-35811896
2. Use a fork of requests from kevinburke: https://github.com/kevinburke/requests/tree/connect-timeout
From its documentation: https://github.com/kevinburke/requests/blob/connect-timeout/docs/user/advanced.rst
kevinburke has requested it to be merged into the main requests project, but it hasn't been accepted yet.
Despite the question being about requests, I find this very easy to do with pycurl CURLOPT_TIMEOUT or CURLOPT_TIMEOUT_MS.
No threading or signaling required:
I believe you can use
multiprocessing
and not depend on a 3rd party package:The timeout passed to
kwargs
is the timeout to get any response from the server, the argumenttimeout
is the timeout to get the complete response.Set
stream=True
and user.iter_content(1024)
. Yes,eventlet.Timeout
just somehow doesn't work for me.The discussion is here https://redd.it/80kp1h
This may be overkill, but the Celery distributed task queue has good support for timeouts.
In particular, you can define a soft time limit that just raises an exception in your process (so you can clean up) and/or a hard time limit that terminates the task when the time limit has been exceeded.
Under the covers, this uses the same signals approach as referenced in your "before" post, but in a more usable and manageable way. And if the list of web sites you are monitoring is long, you might benefit from its primary feature -- all kinds of ways to manage the execution of a large number of tasks.
In case you're using the option
stream=True
you can do this:The solution does not need signals or multiprocessing.