Requests module encoding provides different encode

2019-02-19 13:42发布

The request module encoding provides different encoding then the actual set encoding in HTML page

Code:

import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding

Output:

ISO-8859-1

Where as the actual encoding set in the HTML is UTF-8 content="text/html; charset=UTF-8"

My Question are:

  1. Why is requests.encoding showing different encoding then the encoding described in the HTML page?.

I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8") since it is already in UTF-8 when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) áchanges to this Ã

Is there any way to convert all type of encodes into UTF-8 ?

3条回答
贼婆χ
2楼-- · 2019-02-19 13:48

Requests sets the response.encoding attribute to ISO-8859-1 when you have a text/* response and no content type has been specified in the response headers.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

You can test for this by looking for a charset parameter in the Content-Type header:

resp = requests.get(....)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None

Your HTML document specifies the content type in a <meta> header, and it is this header that is authoritative:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

HTML 5 also defines a <meta charset="..." /> tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">

You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.

Using BeautifulSoup:

# pass in explicit encoding if set as a header
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
content = resp.content
soup = BeautifulSoup(content, from_encoding=encoding)
if soup.original_encoding != 'utf-8':
    meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]')
    if meta:
        # replace the meta charset info before re-encoding
        if 'charset' in meta.attrs:
            meta['charset'] = 'utf-8'
        else:
            meta['content'] = 'text/html; charset=utf-8'
    # re-encode to UTF-8
    content = soup.prettify()  # encodes to UTF-8 by default

Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a <?xml encoding="..." ... ?> XML declaration, again part of the document.

查看更多
爱情/是我丢掉的垃圾
3楼-- · 2019-02-19 14:00

Requests replies on the HTTP Content-Type response header and chardet. For the common case of text/html, it assumes a default of ISO‌-8859-1. The issue is that Requests doesn't know anything about HTML meta tags, which can specify a different text encoding, e.g. <meta charset="utf-8"> or <meta http-equiv="content-type" content="text/html; charset=UTF‌-8">.

A good solution is to use BeautifulSoup's "Unicode, Dammit" feature, like this:

from bs4 import UnicodeDammit
import requests


url = 'http://www.reynamining.com/nuevositio/contacto.html'
r = requests.get(url)

dammit = UnicodeDammit(r.content)
r.encoding = dammit.original_encoding

print(r.text)
查看更多
干净又极端
4楼-- · 2019-02-19 14:06

Requests will first check for an encoding in the HTTP header:

print obj.headers['content-type']

output:

text/html

does not correctly parse the type of encoding guess therefore it specifies default ISO-8859-1.

see more in docs .

查看更多
登录 后发表回答