I get a link from a web page by using beautiful soup library through a.get('href')
. In the link there is a strange character ®
but when I get it became ®
. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-
r = requests.get(url)
soup = BeautifulSoup(r.text)
Do not use r.text
; leave decoding to BeautifulSoup
:
soup = BeautifulSoup(r.content)
r.content
gives you the response in bytes, without decoding. r.text
on the other hand, is the response decoded to unicode
.
What happens is that the server did not include the character-set in the response headers. At that moment, requests
follows the HTTP RFC 2261, section 3.7.1: text/
responses by default are expected to use the ISO-8859-1 (Latin 1) character set.
For your HTML page, that default is wrong, and you got incorrect results; r.text
decoded the bytes as Latin-1, resulting in a Mojibake:
>>> print u'®'.encode('utf8').decode('latin1')
®
HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta>
tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.
Even if the <meta>
header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.