Writing utf-8 formated Python lists to CSV

2019-07-14 22:34发布

问题:

How to write utf-8 characters to csv file?

My data and code:

# -*- coding: utf-8 -*-

l1 = ["žžž", "ččč"]
l2 = ["žžž", "ččč"]

thelist = [l1, l2]

import csv
import codecs

with codecs.open('test', 'w', "utf-8-sig") as f:
   writer = csv.writer(f)
   for x in thelist:
       print x
       for mem in x:
           writer.writerow(mem) 

The error message:

Traceback (most recent call last):
   File "2010rudeni priimti.py", line 263, in <module>
writer.writerow(mem)
 File "C:\Python27\lib\codecs.py", line 691, in write
return self.writer.write(data)
 File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
 File "C:\Python27\lib\encodings\utf_8_sig.py", line 82, in encode
return encode(input, errors)
 File "C:\Python27\lib\encodings\utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)

Press any key to continue . . .

What is my mistake?

回答1:

The csv module in 2.x doesn't read/write Unicode, it reads/writes bytes (and assumes they're ASCII-compatible, but that's not a problem with UTF-8).

So, when you give it a codecs Unicode file to write to, it passes a str rather than a unicode. And when codecs tries to encode that to UTF-8, it has to first decode it to Unicode, for which it uses your default encoding, which is ASCII, which fails. Hence this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)

The solution is explained in the docs, with a wrapper in Examples that takes care of everything for you. Use the UnicodeWriter with a plain binary file, instead of using a codecs file.


As an alternative, there are a few different packages on PyPI that wrap up the csv module to deal directly in unicode instead of str, like unicodecsv.

As a more radical alternative, Python 3.x's csv module doesn't have this problem in the first place (and 3.x also doesn't have the next problem).

A much hacker alternative is to just pretend the entire world is UTF-8. After all, both your source code and your output are intended to be UTF-8, and the csv module doesn't care about anything but a handful of characters (newlines, commas, maybe quotes and backslashes) being ASCII-compatible. So you could just skip decoding and encoding entirely, and everything will work. The obvious down side here is that if you get anything wrong, instead of getting an error to debug, you will get a file full of garbage.


There are two other problems with your code, neither of which UnicodeWriter or unicodecsv can magically fix (although Python 3 can fix the first).

First, you're not actually giving the csv module Unicode in the first place. The columns in your source data are plain old str literals, like "žžž". You can't encode that to UTF-8—or, rather, you can, but only by automatically decoding it as ascii first, which will just cause the same error again. Use Unicode literals, like u"žžž", to avoid this (or, if you prefer, explicitly decode from your source encoding… but that's kind of silly).

Second, you haven't specified an encoding declaration in your source, but you've used non-ASCII characters. Technically, this is illegal in Python 2.7. Practically, I'm pretty sure it gives you a warning but then treats your source as Latin-1. Which is bad, because you're clearly not using a Latin-1 editor (you can't put ž in a Latin-1 text file, because there is no such character). If you're saving the file as UTF-8 and then telling Python to interpret it as Latin-1, you're going to end up with žžž instead of žžž, and similar mojibake.