python-requests: fetching the head of the response

2019-03-17 03:14发布

问题:

Using python-requests and python-magic, I would like to test the mime-type of a web resource without fetching all its content (especially if this resource happens to be eg. an ogg file or a PDF file). Based on the result, I might decide to fetch it all. However calling the text method after having tested the mime-type only returns what hasn't been consumed yet. How could I test the mime-type without consuming the response content?

Below is my current code.

import requests
import magic


r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
mime = magic.from_buffer(r.iter_content(256).next(), mime=True)

if mime == "text/html":
    print(r.text)  # I'd like r.text to give me the entire response content

Thanks!

回答1:

Note: at the time this question was asked, the correct method to fetch only headers stream the body was to use prefetch=False. That option has since been renamed to stream and the boolean value is inverted, so you want stream=True.

The original answer follows.


Once you use iter_content(), you have to continue using it; .text indirectly uses the same interface under the hood (via .content).

In other words, by using iter_content() at all, you have to do the work .text does by hand:

from requests.compat import chardet

r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
peek = r.iter_content(256).next()
mime = magic.from_buffer(peek, mime=True)

if mime == "text/html":
    contents = peek + b''.join(r.iter_content(10 * 1024))
    encoding = r.encoding
    if encoding is None:
        # detect encoding
        encoding = chardet.detect(contents)['encoding']
    try:
        textcontent = str(contents, encoding, errors='replace')
    except (LookupError, TypeError):
        textcontent = str(contents, errors='replace')
    print(textcontent)

presuming you use Python 3.

The alternative is to make 2 requests:

r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
mime = magic.from_buffer(r.iter_content(256).next(), mime=True)

if mime == "text/html":
     print(r.requests.get("http://www.december.com/html/demo/hello.html").text)

Python 2 version:

r = requests.get("http://www.december.com/html/demo/hello.html", prefetch=False)
peek = r.iter_content(256).next()
mime = magic.from_buffer(peek, mime=True)

if mime == "text/html":
    contents = peek + ''.join(r.iter_content(10 * 1024))
    encoding = r.encoding
    if encoding is None:
        # detect encoding
        encoding = chardet.detect(contents)['encoding']
    try:
        textcontent = unicode(contents, encoding, errors='replace')
    except (LookupError, TypeError):
        textcontent = unicode(contents, errors='replace')
    print(textcontent)


回答2:

if 'content-type' suffices, you can issue an HTTP 'Head' request instead of 'Get', to just receive the HTTP headers.

import requests

url = 'http://www.december.com/html/demo/hello.html'
response = requests.head(url)
print response.headers['content-type']