Encoding issue of a character in utf-8

2019-05-10 05:16发布

I get a link from a web page by using beautiful soup library through a.get('href'). In the link there is a strange character ® but when I get it became Â®. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-

r = requests.get(url)

soup = BeautifulSoup(r.text)

标签： python utf-8 beautifulsoup python-requests mojibake

1条回答

兄弟一词,经得起流年.

2楼-- · 2019-05-10 06:10

Do not use r.text; leave decoding to BeautifulSoup:

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode.

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1: text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake:

>>> print u'®'.encode('utf8').decode('latin1')
Â®

HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.

0人赞添加讨论(0) 举报

Encoding issue of a character in utf-8

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间