Decoding unknown encoded Traditional Chinese chara

2019-02-28 03:45发布

Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 è¦ªå%90é¤%90å»³ which obviously makes no sense to me. My question is what is this encoding called? And is there a way to use Python to decode this character string. Thank you.

标签： python text-manipulation

2条回答

淡お忘

2楼-- · 2019-02-28 04:32

It is called a mutt encoding; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.

It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. I was able to un-mangle this by interpreting it as such:

>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 è¦ªå%90é¤%90å»³'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2019-02-28 04:32

You can use chardet. Install the library with:

pip install chardet
# or for python3
pip3 install chardet

The library includes a cli utility chardetect (or chardetect3 accordingly) that takes the path to a file.

Once you know the encoding you can use in python something like:

codecs.open('myfile.txt', 'r', 'GB2312')

or from shell:

iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt

0人赞添加讨论(0) 举报

Decoding unknown encoded Traditional Chinese chara

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间