python module like csv-DictReader with full utf8 s

2019-04-15 14:33发布

I need import data from a csv in my project and i need a object like DictReader, but with full utf8 supports, anyone knows a module or app with this?

标签: python csv utf-8
2条回答
地球回转人心会变
2楼-- · 2019-04-15 14:36

As the answer to this post said :

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

You can see below my example code. I'm using your csv file (see comments).

import csv

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

f = open('sampleresults.csv', 'r')
a = UnicodeDictReader(f)
for i in a:
    if i['NOMBRE'] == 'GUIDO ALEJANDRO':
        print i['APELLIDO']

Ouput:

MUÑOZ RENGIFO

You can see that the 'Ñ' is correctly encoded.

查看更多
狗以群分
3楼-- · 2019-04-15 14:51

Your data is NOT encoded in UTF-8. It is (mostly) encoded in cp1252. The data appears to include Spanish names. The most prevalent non-ASCII character is '\xd1` (i.e. Latin capital letter N with tilde) -- this is the character that caused the exception.

One of the non-ASCII characters in the file is '\x8d'. It is NOT in cp1252. It appears where the letter A should appear in the name VASQUEZ. Of the others, '\x94' (curly double quote in cp1252) appears in the middle of a name. The remaining ones may also represent errors.

I suggest that you run this little code fragment to print lines with suspicious characters in them:

for lino, line in enumerate(open('sampleresults.csv')):
    if any(c in line for c in '\x8d\x94\xc1\xcf\xd3'): print "%d %r\n" % (lino+1, line)

and fix up the data.

Then you need a csv DictReader with full and generalised decoding support. Full means decoding the fieldnames aka dict keys as well as the data. Generalised means no hardcoding of the encoding.

import csv

def UnicodeDictReader(str_data, encoding, **kwargs):
    csv_reader = csv.DictReader(str_data, **kwargs)
    # Decode the keys once
    keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
    for row in csv_reader:
        yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())

dozedata = ['\xd1,\xff', '\xd2,\xfe', '3,4']
print list(UnicodeDictReader(dozedata, 'cp1252'))

Output:

[{u'\xd1': u'\xd2', u'\xff': u'\xfe'}, {u'\xd1': u'3', u'\xff': u'4'}]

and here is what you get with your sample file (first data row only, Python 2.7.1, Windows 7):

>>> import csv
>>> from pprint import pprint as pp
>>> def UnicodeDictReader(str_data, encoding, **kwargs):
...     csv_reader = csv.DictReader(str_data, **kwargs)
...     # Decode the keys once
...     keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
...     for row in csv_reader:
...         yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())
...
>>> f = open('sampleresults.csv', 'rb')
>>> drdr = UnicodeDictReader(f, 'cp1252')
>>> pp(drdr.next())
{u'APELLIDO': u'=== family names redacted ===',
 u'CATEGORIA': u'ABIERTA',
 u'CEDULA': u'10000640',
 u'DELAY': u' 0:20',
 u'EDAD': u'25',
 u'EMAIL': u'mimail640',
 u'NO.': u'640',
 u'NOMBRE': u'=== given names redacted ===',
 u'POSICION CATEGORIA': u'1',
 u'POSICION CATEGORIA EN KM.5': u'11',
 u'POSICION GENERAL CHIP': u'1',
 u'POSICION GENERAL EN KM.5': u'34',
 u'POSICION GENERAL GUN': u'1',
 u'POSICION GENERO': u'1',
 u'PRIMEROS 5KM.': u'0:32:55',
 u'PROMEDIO/KM.': u' 5:44',
 u'SEGUNDOS KM.': u'0:24:05',
 u'SEX': u'M',
 u'TIEMPO CHIP': u'0:56:59',
 u'TIEMPO GUN': u'0:57:19'}
>>>
查看更多
登录 后发表回答