How to normalize unicode encoding for iso-8859-15

2019-05-12 22:44发布

问题:

I want to convert unicode string into iso-8859-15. These strings include the u"\u2019" (RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.

In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?

I have looked at the unicodedata module without success. I manage to do the job with

s.replace(u"\u2019", "'").encode('iso-8859-15')

but I would like to find a more general and cleaner way.

Thanks for your help

回答1:

Use the unicode version of the translate function, assuming s is a unicode string:

s.translate({ord(u"\u2019"):ord(u"'")})

The argument of the unicode version of translate is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.

You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:

char_mappings = [(u"\u2019", u"'"),
                 (u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}

From translate documentation:

For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).



回答2:

Unless you wish to create a translation rule (if you do, look at Boud's answer), you could choose one of the default error handlers encode provides or even register your own one:

In [4]: u'\u2019 Hi'.encode('iso-8859-15', 'replace')
Out[4]: '? Hi'

In [5]: u'\u2019 Hi'.encode('iso-8859-15', 'ignore')
Out[5]: ' Hi'

In [6]: u'\u2019 Hi'.encode('iso-8859-15', 'xmlcharrefreplace')
Out[6]: '’ Hi'

From encode docstring:

S.encode([encoding[,errors]]) -> string or unicode

Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.



回答3:

For info, my final solution:

iso885915_utf_map = {
    u"\u2019":  u"'",
    u"\u2018":  u"'",
    u"\u201c":  u'"',
    u"\u201d":  u'"',
}
utf_map = dict([(ord(k), ord(v)) for k,v in iso885915_utf_map.items()])
s.translate(utf_map).encode('iso-8859-15')

Thank you for your help