Is there a Python library function which attempts

This question already has an answer here:

Determine the encoding of text in Python 8 answers

I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError.

So, I'm looking for a function that takes a str and optionally some hints and does its darndest to give me back a unicode. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.

I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".

标签： python email character-encoding invalid-characters

4条回答

Summer. ? 凉城

2楼-- · 2019-01-23 14:48

You may be interested in Universal Encoding Detector.

0人赞添加讨论(0) 举报

时光不老，我们不散

3楼-- · 2019-01-23 14:57

+1 for the chardet module (suggested by @insin).

It is not in the standard library, but you can easily install it with the following command:

$ pip install chardet

Example:

>>> import chardet
>>> import urllib
>>> detect = lambda url: chardet.detect(urllib.urlopen(url).read())
>>> detect('http://stackoverflow.com')
{'confidence': 0.85663169917190185, 'encoding': 'ISO-8859-2'}    
>>> detect('https://stackoverflow.com/questions/269060/is-there-a-python-lib')
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

See Installing Pip if you don't have one.

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2019-01-23 15:01

As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.

def decode(s, encodings=('ascii', 'utf8', 'latin1')):
    for encoding in encodings:
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            pass
    return s.decode('ascii', 'ignore')

0人赞添加讨论(0) 举报

成全新的幸福

5楼-- · 2019-01-23 15:04

The best way to do this that I've found is to iteratively try decoding a prospective with each of the most common encodings inside of a try except block.

0人赞添加讨论(0) 举报

Is there a Python library function which attempts

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间