Have I got that all the right way round? Anyway, I am parsing a lot of html, but I don't always know what encoding it's meant to be (a surprising number lie about it). The code below easily shows what I've been doing so far, but I'm sure there's a better way. Your suggestions would be much appreciated.
import logging
import codecs
from utils.error import Error
class UnicodingError(Error):
pass
# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855",
"cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949",
"cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258",
"euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
"iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5",
"iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u",
"mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004",
"shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]
def unicode(string):
'''make unicode'''
for enc in self.encodings:
try:
logging.debug("unicoder is trying " + enc + " encoding")
utf8 = unicode(string, enc)
logging.info("unicoder is using " + enc + " encoding")
return utf8
except UnicodingError:
if enc == self.encodings[-1]:
raise UnicodingError("still don't recognise encoding after trying do guess.")
There are two general purpose libraries for detecting unknown encodings:
chardet is supposed to be a port of the way that firefox does it
You can use the following regex to detect utf8 from byte strings:
In practice if you're dealing with English I've found the following works 99.9% of the time:
Since you are using Python, you might try
UnicodeDammit
. It is part of Beautiful Soup that you also may find useful.Like the name suggests,
UnicodeDammit
will try to do whatever it takes to get proper unicode out of the crap you may find in the world.I've tackled the same problem and found that there's no way to determine a content's encoding type without metadata about the content. That's why I ended up with the same approach you're trying here.
My only additional advice to what you've done is, rather than ordering the list of possible encoding in most-likely order, you should order it by specificity. I've found that certain character sets are subsets of others, and so if you check
utf_8
as your second choice, you'll miss ever finding the subsets ofutf_8
(I think one of the Korean character sets uses the same number space as utf).