How to fix broken utf-8 encoding in Python?

2019-03-28 05:21发布

My string is Niá»‡m Bá»“ TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start to try by Python

mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
mystr.decode('utf-8')

but actually it is not correct because original string is utf-8 but the string show is not my expecting result.

Note: it is Vietnamese character.

How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.

标签： python unicode utf-8 character-encoding

2条回答

虎瘦雄心在

2楼-- · 2019-03-28 06:01

I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-03-28 06:06

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

>>> from ftfy import fix_encoding
>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

0人赞添加讨论(0) 举报

How to fix broken utf-8 encoding in Python?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间