Encoding/decoding non-ASCII character when using P

2019-09-01 18:40发布

I have some data with non-ASCII characters. I attempted to take care of it using the following:

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import sys
import re
reload(sys)
sys.setdefaultencoding('latin1')

Though I have identified some records still giving me encoding/decoding problem. I have copied and pasted one of the problematic record (containing the name and location columns of the record) as below:

'EugÃ¨ne Badeau'    'E, QuÃ©bec (county/comtÃ©), Quebec, Canada'

Using the .decode('utf-8') adding to the exact text extraction it resolved the problem.

print 'EugÃ¨ne Badeau   E, QuÃ©bec (county/comtÃ©), Quebec, Canada'.decode('utf-8')
output: Eugène Badeau   E, Québec (county/comté), Quebec, Canada

So I try to use it to convert my pandas column:

df.name = df.name.str.encode('utf-8')

The location seems to be ok, but the name is still wrong:

print df.location[735]
print df.name[735]

output:
E, Québec (county/comté), Quebec, Canada
eugã¨ne badeau

标签： python-2.7 pandas character-encoding ascii non-ascii-characters

1条回答

我欲成王，谁敢阻挡

2楼-- · 2019-09-01 19:35

You could do apply combined with unidecode lib:

from unidecode import unidecode

df['name']=df['name'].apply( lambda x:  unidecode(unicode(x, encoding = "utf-8")))
df['location']=df['location'].apply( lambda x:  unidecode(unicode(x, encoding = "utf-8")))

;)

0人赞添加讨论(0) 举报

Encoding/decoding non-ASCII character when using P

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间