I have some data with non-ASCII characters. I attempted to take care of it using the following:
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import sys
import re
reload(sys)
sys.setdefaultencoding('latin1')
Though I have identified some records still giving me encoding/decoding problem. I have copied and pasted one of the problematic record (containing the name and location columns of the record) as below:
'Eugène Badeau' 'E, Québec (county/comté), Quebec, Canada'
Using the .decode('utf-8') adding to the exact text extraction it resolved the problem.
print 'Eugène Badeau E, Québec (county/comté), Quebec, Canada'.decode('utf-8')
output: Eugène Badeau E, Québec (county/comté), Quebec, Canada
So I try to use it to convert my pandas column:
df.name = df.name.str.encode('utf-8')
The location seems to be ok, but the name is still wrong:
print df.location[735]
print df.name[735]
output:
E, Québec (county/comté), Quebec, Canada
eugã¨ne badeau
You could do apply combined with unidecode lib:
;)