What I\'m trying to do here is get the headers of a given URL so I can determine the MIME type. I want to be able to see if http://somedomain/foo/
will return an HTML document or a JPEG image for example. Thus, I need to figure out how to send a HEAD request so that I can read the MIME type without having to download the content. Does anyone know of an easy way of doing this?
问题:
回答1:
edit: This answer works, but nowadays you should just use the requests library as mentioned by other answers below.
Use httplib.
>>> import httplib
>>> conn = httplib.HTTPConnection(\"www.google.com\")
>>> conn.request(\"HEAD\", \"/index.html\")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[(\'content-length\', \'0\'), (\'expires\', \'-1\'), (\'server\', \'gws\'), (\'cache-control\', \'private, max-age=0\'), (\'date\', \'Sat, 20 Sep 2008 06:43:36 GMT\'), (\'content-type\', \'text/html; charset=ISO-8859-1\')]
There\'s also a getheader(name)
to get a specific header.
回答2:
urllib2 can be used to perform a HEAD request. This is a little nicer than using httplib since urllib2 parses the URL for you instead of requiring you to split the URL into host name and path.
>>> import urllib2
>>> class HeadRequest(urllib2.Request):
... def get_method(self):
... return \"HEAD\"
...
>>> response = urllib2.urlopen(HeadRequest(\"http://google.com/index.html\"))
Headers are available via response.info() as before. Interestingly, you can find the URL that you were redirected to:
>>> print response.geturl()
http://www.google.com.au/index.html
回答3:
Obligatory Requests
way:
import requests
resp = requests.head(\"http://www.google.com\")
print resp.status_code, resp.text, resp.headers
回答4:
I believe the Requests library should be mentioned as well.
回答5:
Just:
import urllib2
request = urllib2.Request(\'http://localhost:8080\')
request.get_method = lambda : \'HEAD\'
response = urllib2.urlopen(request)
response.info().gettype()
Edit: I\'ve just came to realize there is httplib2 :D
import httplib2
h = httplib2.Http()
resp = h.request(\"http://www.google.com\", \'HEAD\')
assert resp[0][\'status\'] == 200
assert resp[0][\'content-type\'] == \'text/html\'
...
link text
回答6:
For completeness to have a Python3 answer equivalent to the accepted answer using httplib.
It is basically the same code just that the library isn\'t called httplib anymore but http.client
from http.client import HTTPConnection
conn = HTTPConnection(\'www.google.com\')
conn.request(\'HEAD\', \'/index.html\')
res = conn.getresponse()
print(res.status, res.reason)
回答7:
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request(\'HEAD\', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader(\'Location\'):
return response.getheader(\'Location\')
else:
return url
回答8:
As an aside, when using the httplib (at least on 2.5.2), trying to read the response of a HEAD request will block (on readline) and subsequently fail. If you do not issue read on the response, you are unable to send another request on the connection, you will need to open a new one. Or accept a long delay between requests.
回答9:
I have found that httplib is slightly faster than urllib2. I timed two programs - one using httplib and the other using urllib2 - sending HEAD requests to 10,000 URL\'s. The httplib one was faster by several minutes. httplib\'s total stats were: real 6m21.334s user 0m2.124s sys 0m16.372s
And urllib2\'s total stats were: real 9m1.380s user 0m16.666s sys 0m28.565s
Does anybody else have input on this?
回答10:
And yet another approach (similar to Pawel answer):
import urllib2
import types
request = urllib2.Request(\'http://localhost:8080\')
request.get_method = types.MethodType(lambda self: \'HEAD\', request, request.__class__)
Just to avoid having unbounded methods at instance level.
回答11:
Probably easier: use urllib or urllib2.
>>> import urllib
>>> f = urllib.urlopen(\'http://google.com\')
>>> f.info().gettype()
\'text/html\'
f.info() is a dictionary-like object, so you can do f.info()[\'content-type\'], etc.
http://docs.python.org/library/urllib.html
http://docs.python.org/library/urllib2.html
http://docs.python.org/library/httplib.html
The docs note that httplib is not normally used directly.