I'm writing a script(multi-threaded) to retrieve contents from a website, and the site's not very stable so every now and then there's hanging http request which cannot even be time-outed by socket.setdefaulttimeout()
. Since I have no control over that website, the only thing I can do is to improve my codes but I'm running out of ideas right now.
Sample codes:
socket.setdefaulttimeout(150)
MechBrowser = mechanize.Browser()
Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)'}
Url = "http://example.com"
Data = "Justatest=whatever&letstry=doit"
Request = urllib2.Request(Url, Data, Header)
Response = MechBrowser.open(Request)
Response.close()
What should I do to force the hanging request to quit? Actually I want to know why socket.setdefaulttimeout(150)
is not working in the first place. Anybody can help me out?
Added:(and yes problem still not solved)
OK, I've followed tomasz's suggestion and changed codes to MechBrowser.open(Request, timeout = 60)
, but same thing happens. I still got hanging requests randomly till now, sometimes it's several hours and other times it could be several days. What do I do now? Is there a way to force these hanging requests to quit?
You could try to use mechanize with eventlet. It does not solve your timeout problem, but greenlet are non blocking, so it can solve your performance problem.
While
socket.setsocketimeout
will set the default timeout for new sockets, if you're not using the sockets directly, the setting can be easily overwritten. In particular, if the library callssocket.setblocking
on its socket, it'll reset the timeout.urllib2.open
has a timeout argument, hovewer, there is no timeout inurllib2.Request
. As you're usingmechanize
, you should refer to their documentation:source: http://wwwsearch.sourceforge.net/mechanize/documentation.html
---EDIT---
If either
socket.setsockettimeout
or passing timeout tomechanize
works with small values, but not with higher, the source of the problem might be completely different. One thing is your library may open multiple connections (here credit to @Cédric Julien), so the timeout apply to every single attempt of socket.open and if it doesn't stop with first failure – can take up totimeout * num_of_conn
seconds. The other thing issocket.recv
: if the connection is really slow and you're unlucky enough, the whole request can take up totimeout * incoming_bytes
as with everysocket.recv
we could get one byte, and every such call could taketimeout
seconds. As you're unlikely to suffer from exactly this dark scenerio (one byte per timeout seconds? you would have to be a very rude boy), it's very likely request to take ages for very slow connections and very high timeouts.The only solution you have is to force timeout for the whole request, but there's nothing to do with sockets here. If you're on Unix, you can use simple solution with
ALARM
signal. You set the signal to be raised intimeout
seconds, and your request will be terminated (don't forget to catch it). You might like to usewith
statement to make it clean and easy for use, example:If want to be more portable than this, you have to use some bigger guns, for example
multiprocessing
, so you'll spawn a process to call your request and terminate it if overdue. As this would be a separate process, you have to use something to transfer the result back to your application, it might bemultiprocessing.Pipe
. Here comes the example:You really don't have much choice if you want to force the request to terminate after fixed number of seconds.
socket.timeout
will provide timeout for single socket operation (connect/recv/send), but if you have multiple of them you can suffer from very long execution time.From their documentation:
Perhaps you should try replacing urllib2.Request with mechanize.Request.
I suggest a simple workaround - move the request to a different process and if it fails to terminate kill it from the calling process, this way:
simple, fast and effective.