I'm trying to use this function:
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
in the code below (which unzips and reads files with non-ASCII strings). But I'm getting this error, (from this library file C:\Python27\Lib\encodings\utf_8.py
):
Message File Name Line Position
Traceback
<module> C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py 64
getNameList C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py 26
remove_accents C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py 17
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 3: ordinal not in range(128)
Why am I getting this error? How to avoid it and make remove_accents
work?
Thanks for any help!
Here's the entire code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
import os
import re
from zipfile import ZipFile
import csv
##def strip_accents(s):
## return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
print name
# name = strip_accents(name)
name = remove_accents(name)
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
# print maleNames
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
#name=row[0].upper().decode('latin1')
name=row[0].upper()
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
# print '\tImported %s'%filename
# print names
return names
if __name__ == "__main__":
getNameList()
Best practice is to decode to Unicode when the data comes into your program:
Then
remove_accents
can just be:Encode data when leaving your program such as writing to a file, database, terminal, etc.
Why remove accents in the first place?
You get it because you are decoding from a bytestring without specifying a codec:
Add a codec there (here I assume your data is encoded in utf-8,
0xe1
would be the first of a 3-byte character):If you want to robustly convert unicode characters to ascii in a string, you should use the awesome unidecode module: