Encoding detection library in python [duplicate]

This question already has an answer here:

How to determine the encoding of text? 8 answers

This is somehow related to my question here.

I process tons of texts (in HTML and XML mainly) fetched via HTTP. I'm looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.

I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet FAQ I don't want to screw the standards.

From the same FAQ here is the list of places where I want to look for encoding:

charset parameter in HTTP Content-type header.
<meta http-equiv="content-type"> element in the <head> of a web page for HTML documents.
encoding attribute in the XML prolog for XML documents.
Auto-detect the character encoding as a last resort.

Basically I want to be able to look in all those place and also deal with conflicting information automatically.

Is there such library out there or do I need to write it myself?

标签： python html xml http character-encoding

2条回答

叛逆

2楼-- · 2019-04-10 11:44

BeautifulSoup's UnicodeDammit, which in turn uses chardet.

chardet by itself is quite useful for the general case (determining text's encoding) but slow as you say. UnicodeDammit adds extra features on top of chardet, in particular that it can look up the encoding explicitly specified in XML's encoding tags.

As for the HTTP Content-type header, I think you need to read that yourself to extract the charset parameter, and then pass it to UnicodeDammit in the fromEncoding parameter.

As for resolving conflicts, UnicodeDammit will give precedence to explicitly-stated encoding (if the encoding doesn't generate errors). See the docs for full details.

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2019-04-10 11:56

BeautifulSoup (the html parser) incorporates a class called UnicodeDammit that does just that. Have a look and see if you like it.

0人赞添加讨论(0) 举报

Encoding detection library in python [duplicate]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间