The request module encoding
provides different encoding then the actual set encoding in HTML page
Code:
import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding
Output:
ISO-8859-1
Where as the actual encoding set in the HTML is UTF-8
content="text/html; charset=UTF-8"
My Question are:
- Why is
requests.encoding
showing different encoding then the encoding described in the HTML page?.
I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8")
since it is already in UTF-8
when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) á
changes to this Ã
Is there any way to convert all type of encodes into UTF-8 ?
Requests sets the
response.encoding
attribute toISO-8859-1
when you have atext/*
response and no content type has been specified in the response headers.See the Encoding section of the Advanced documentation:
Bold emphasis mine.
You can test for this by looking for a
charset
parameter in theContent-Type
header:Your HTML document specifies the content type in a
<meta>
header, and it is this header that is authoritative:HTML 5 also defines a
<meta charset="..." />
tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.
Using BeautifulSoup:
Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a
<?xml encoding="..." ... ?>
XML declaration, again part of the document.Requests replies on the HTTP
Content-Type
response header andchardet
. For the common case oftext/html
, it assumes a default ofISO-8859-1
. The issue is that Requests doesn't know anything about HTML meta tags, which can specify a different text encoding, e.g.<meta charset="utf-8">
or<meta http-equiv="content-type" content="text/html; charset=UTF-8">
.A good solution is to use BeautifulSoup's "Unicode, Dammit" feature, like this:
Requests will first check for an encoding in the HTTP header:
output:
does not correctly parse the type of encoding guess therefore it specifies default ISO-8859-1.
see more in docs .