General Unicode/UTF-8 support for csv files in Pyt-第2页回答

The csv module in Python doesn't work properly when there's UTF-8/Unicode involved. I have found, in the Python documentation and on other webpages, snippets that work for specific cases but you have to understand well what encoding you are handling and use the appropriate snippet.

How can I read and write both strings and Unicode strings from .csv files that "just works" in Python 2.6? Or is this a limitation of Python 2.6 that has no simple solution?

标签： python csv unicode utf-8 python-2.x

10条回答

你好瞎i

2楼-- · 2019-01-03 14:50

Here is an slightly improved version of Maxim's answer, which can also skip the UTF-8 BOM:

import csv
import codecs

class UnicodeCsvReader(object):
    def __init__(self, csv_file, encoding='utf-8', **kwargs):
        if encoding == 'utf-8-sig':
            # convert from utf-8-sig (= UTF8 with BOM) to plain utf-8 (without BOM):
            self.csv_file = codecs.EncodedFile(csv_file, 'utf-8', 'utf-8-sig')
            encoding = 'utf-8'
        else:
            self.csv_file = csv_file
        self.csv_reader = csv.reader(self.csv_file, **kwargs)
        self.encoding = encoding

    def __iter__(self):
        return self

    def next(self):
        # read and split the csv row into fields
        row = self.csv_reader.next() 
        # now decode
        return [unicode(cell, self.encoding) for cell in row]

    @property
    def line_num(self):
        return self.csv_reader.line_num

class UnicodeDictReader(csv.DictReader):
    def __init__(self, csv_file, encoding='utf-8', fieldnames=None, **kwds):
        reader = UnicodeCsvReader(csv_file, encoding=encoding, **kwds)
        csv.DictReader.__init__(self, reader.csv_file, fieldnames=fieldnames, **kwds)
        self.reader = reader

Note that the presence of the BOM is not automatically detected. You must signal it is there by passing the encoding='utf-8-sig' argument to the constructor of UnicodeCsvReader or UnicodeDictReader. Encoding utf-8-sig is utf-8 with a BOM.

0人赞添加讨论(0) 举报

等我变得足够好

3楼-- · 2019-01-03 14:55

The example code of how to read Unicode given at http://docs.python.org/library/csv.html#examples looks to be obsolete, as it doesn't work with Python 2.6 and 2.7.

Here follows UnicodeDictReader which works with utf-8 and may be with other encodings, but I only tested it on utf-8 inputs.

The idea in short is to decode Unicode only after a csv row has been split into fields by csv.reader.

class UnicodeCsvReader(object):
    def __init__(self, f, encoding="utf-8", **kwargs):
        self.csv_reader = csv.reader(f, **kwargs)
        self.encoding = encoding

    def __iter__(self):
        return self

    def next(self):
        # read and split the csv row into fields
        row = self.csv_reader.next() 
        # now decode
        return [unicode(cell, self.encoding) for cell in row]

    @property
    def line_num(self):
        return self.csv_reader.line_num

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding="utf-8", fieldnames=None, **kwds):
        csv.DictReader.__init__(self, f, fieldnames=fieldnames, **kwds)
        self.reader = UnicodeCsvReader(f, encoding=encoding, **kwds)

Usage (source file encoding is utf-8):

csv_lines = (
    "абв,123",
    "где,456",
)

for row in UnicodeCsvReader(csv_lines):
    for col in row:
        print(type(col), col)

Output:

$ python test.py
<type 'unicode'> абв
<type 'unicode'> 123
<type 'unicode'> где
<type 'unicode'> 456

0人赞添加讨论(0) 举报

成全新的幸福

4楼-- · 2019-01-03 14:58

The module provided here, looks like a cool, simple, drop-in replacement for the csv module that allows you to work with utf-8 csv.

import ucsv as csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

0人赞添加讨论(0) 举报

Root（大扎）

5楼-- · 2019-01-03 15:03

I would add to itsadok's answer. By default, excel saves csv files as latin-1 (which ucsv does not support). You can easily fix this by:

with codecs.open(csv_path, 'rb', 'latin-1') as f:
    f = StringIO.StringIO( f.read().encode('utf-8') )

reader = ucsv.UnicodeReader(f)
# etc.

0人赞添加讨论(0) 举报

上一页 1 2

General Unicode/UTF-8 support for csv files in Pyt

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间