Does urllib2
fetch the whole page when a urlopen
call is made?
I'd like to just read the HTTP response header without getting the page. It looks like urllib2
opens the HTTP connection and then subsequently gets the actual HTML page... or does it just start buffering the page with the urlopen
call?
import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers
html = page.readlines() // stream page
Use the
response.info()
method to get the headers.From the urllib2 docs:
So, for your example, try stepping through the result of
response.info().headers
for what you're looking for.Note the major caveat to using httplib.HTTPMessage is documented in python issue 4773.
One-liner:
What about sending a HEAD request instead of a normal GET request. The following snipped (copied from a similar question) does exactly that.
urllib2.urlopen does an HTTP GET (or POST if you supply a data argument), not an HTTP HEAD (if it did the latter, you couldn't do readlines or other accesses to the page body, of course).
Actually, it appears that urllib2 can do an HTTP HEAD request.
The question that @reto linked to, above, shows how to get urllib2 to do a HEAD request.
Here's my take on it:
If you check this with something like the Wireshark network protocol analazer, you can see that it is actually sending out a HEAD request, rather than a GET.
This is the HTTP request and response from the code above, as captured by Wireshark:
However, as mentioned in one of the comments in the other question, if the URL in question includes a redirect then urllib2 will do a GET request to the destination, not a HEAD. This could be a major shortcoming, if you really wanted to only make HEAD requests.
The request above involves a redirect. Here is request to the destination, as captured by Wireshark:
An alternative to using urllib2 is to use Joe Gregorio's httplib2 library:
This has the advantage of using HEAD requests for both the initial HTTP request and the redirected request to the destination URL.
Here's the first request:
Here's the second request, to the destination: