don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco
Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"
b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.
import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/
There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm
Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*-
to the top of your .py
file if you get error