I get a link from a web page by using beautiful soup library through a.get('href')
. In the link there is a strange character ®
but when I get it became ®
. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-
r = requests.get(url)
soup = BeautifulSoup(r.text)
Do not use
r.text
; leave decoding toBeautifulSoup
:r.content
gives you the response in bytes, without decoding.r.text
on the other hand, is the response decoded tounicode
.What happens is that the server did not include the character-set in the response headers. At that moment,
requests
follows the HTTP RFC 2261, section 3.7.1:text/
responses by default are expected to use the ISO-8859-1 (Latin 1) character set.For your HTML page, that default is wrong, and you got incorrect results;
r.text
decoded the bytes as Latin-1, resulting in a Mojibake:HTML can itself include the correct encoding in the HTML page itself, in the form of a
<meta>
tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.Even if the
<meta>
header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.