I'm looking for a way in Python (2.7) to do HTTP requests with 3 requirements:
- timeout (for reliability)
- content maximum size (for security)
- connection pooling (for performance)
I've checked quite every python HTTP librairies, but none of them meet my requirements. For instance:
urllib2: good, but no pooling
import urllib2
import json
r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100:
print 'too large'
r.close()
else:
print json.loads(content)
r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000:
print 'too large'
r.close()
else:
print json.loads(content)
requests: no max size
import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real size
print json.loads(content) # content is gzipped so pretty useless
print r.json() # Does not work anymore since raw.read was used
urllib3: never got the "read" method working, even with a 50Mo file ...
httplib: httplib.HTTPConnection is not a pool (only one connection)
I can hardly belive that urllib2 is the best HTTP library I can use ! So if anyone knows what librairy can do this or how to use one of the previous librairy ...
EDIT:
The best solution I found thanks to Martijn Pieters (StringIO does not slow down even for huge files, where str addition does a lot).
r = requests.get('https://github.com/timeline.json', stream=True)
size = 0
ctt = StringIO()
for chunk in r.iter_content(2048):
size += len(chunk)
ctt.write(chunk)
if size > maxsize:
r.close()
raise ValueError('Response too large')
content = ctt.getvalue()
You can do it with
requests
just fine; but you need to know that theraw
object is part of theurllib3
guts and make use of the extra arguments theHTTPResponse.read()
call supports, which lets you specify you want to read decoded data:Alternatively, you can set the
decode_content
flag on theraw
object before reading:If you don't like reaching into
urllib3
guts like that, use theresponse.iter_content()
to iterate over the decoded content in chunks; this uses the underlyingHTTPResponse
too (using the.stream()
generator version:There is of subtle difference here in how compressed data sizes are handled here;
r.raw.read(100000+1)
will only ever read 100k bytes of compressed data; the uncompressed data is tested against your max size. Theiter_content()
method will read more uncompressed data in the rare case the compressed stream is larger than the uncompressed data.Neither method allows
r.json()
to work; theresponse._content
attribute isn't set by these; you can do so manually of course. But since the.raw.read()
and.iter_content()
calls already give you access to the content in question, there is really no need.