I have some text that uses Unicode punctuation, like left double quote, right single quote for apostrophe, and so on, and I need it in ASCII. Does Python have a database of these characters with obvious ASCII substitutes so I can do better than turning them all into "?" ?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
In my original answer, I also suggested
unicodedata.normalize
. However, I decided to test it out and it turns out it doesn't work with Unicode quotation marks. It does a good job translating accented Unicode characters, so I'm guessingunicodedata.normalize
is implemented using theunicode.decomposition
function, which leads me to believe it probably can only handle Unicode characters that are combinations of a letter and a diacritical mark, but I'm not really an expert on the Unicode specification, so I could just be full of hot air...In any event, you can use
unicode.translate
to deal with punctuation characters instead. Thetranslate
method takes a dictionary of Unicode ordinals to Unicode ordinals, thus you can create a mapping that translates Unicode-only punctuation to ASCII-compatible punctuation:You can add more punctuation mappings if needed, but I don't think you necessarily need to worry about handling every single Unicode punctuation character. If you do need to handle accents and other diacritical marks, you can still use
unicodedata.normalize
to deal with those characters.Interesting question.
Google helped me find this page which descibes using the unicodedata module as the following:
Unidecode looks like a complete solution. It converts fancy quotes to ascii quotes, accented latin characters to unaccented and even attempts transliteration to deal with characters that don't have ASCII equivalents. That way your users don't have to see a bunch of ? when you had to pass their text through a legacy 7-bit ascii system.
http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/
There's additional discussion about this at http://code.activestate.com/recipes/251871/ which has the NFKD solution and some ways of doing a conversion table, for things like ± => +/- and other non-letter characters.