When the content-type
of the server is 'Content-Type:text/html'
, requests.get()
returns improperly encoded data.
However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8'
, it returns properly encoded data.
Also, when we use urllib.urlopen()
, it returns properly encoded data.
Has anyone noticed this before? Why does requests.get()
behave like this?
From requests documentation:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
Regarding the differences between requests
and urllib.urlopen
- they probably use different ways to guess the encoding. Thats all.
Educated guesses (mentioned above) are probably just a check for Content-Type
header as being sent by server (quite misleading use of educated imho).
For response header Content-Type: text/html
the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).
For response header Content-Type: text/html; charset=utf-8
the result is UTF-8.
Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding
), so you usually want to do:
r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text
The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.
Use .content
to access the byte stream, or .text
to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text
may be off.
After getting response, take response.content
instead of response.text
and that will be of encoding utf-8
.
response = requests.get(download_link, auth=(myUsername, myPassword), headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
body = response.content
else:
print ("Unable to get response with Code : %d " % (response.status_code))