How to write utf-8 characters to csv file?
My data and code:
# -*- coding: utf-8 -*-
l1 = ["žžž", "ččč"]
l2 = ["žžž", "ččč"]
thelist = [l1, l2]
import csv
import codecs
with codecs.open('test', 'w', "utf-8-sig") as f:
writer = csv.writer(f)
for x in thelist:
print x
for mem in x:
writer.writerow(mem)
The error message:
Traceback (most recent call last):
File "2010rudeni priimti.py", line 263, in <module>
writer.writerow(mem)
File "C:\Python27\lib\codecs.py", line 691, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
File "C:\Python27\lib\encodings\utf_8_sig.py", line 82, in encode
return encode(input, errors)
File "C:\Python27\lib\encodings\utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)
Press any key to continue . . .
What is my mistake?
The csv
module in 2.x doesn't read/write Unicode, it reads/writes bytes (and assumes they're ASCII-compatible, but that's not a problem with UTF-8).
So, when you give it a codecs
Unicode file to write to, it passes a str
rather than a unicode
. And when codecs
tries to encode
that to UTF-8, it has to first decode
it to Unicode, for which it uses your default encoding, which is ASCII, which fails. Hence this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)
The solution is explained in the docs, with a wrapper in Examples that takes care of everything for you. Use the UnicodeWriter
with a plain binary file, instead of using a codecs
file.
As an alternative, there are a few different packages on PyPI that wrap up the csv
module to deal directly in unicode
instead of str
, like unicodecsv
.
As a more radical alternative, Python 3.x's csv
module doesn't have this problem in the first place (and 3.x also doesn't have the next problem).
A much hacker alternative is to just pretend the entire world is UTF-8. After all, both your source code and your output are intended to be UTF-8, and the csv
module doesn't care about anything but a handful of characters (newlines, commas, maybe quotes and backslashes) being ASCII-compatible. So you could just skip decoding and encoding entirely, and everything will work. The obvious down side here is that if you get anything wrong, instead of getting an error to debug, you will get a file full of garbage.
There are two other problems with your code, neither of which UnicodeWriter
or unicodecsv
can magically fix (although Python 3 can fix the first).
First, you're not actually giving the csv
module Unicode in the first place. The columns in your source data are plain old str
literals, like "žžž"
. You can't encode that to UTF-8—or, rather, you can, but only by automatically decoding it as ascii first, which will just cause the same error again. Use Unicode literals, like u"žžž"
, to avoid this (or, if you prefer, explicitly decode
from your source encoding… but that's kind of silly).
Second, you haven't specified an encoding declaration in your source, but you've used non-ASCII characters. Technically, this is illegal in Python 2.7. Practically, I'm pretty sure it gives you a warning but then treats your source as Latin-1. Which is bad, because you're clearly not using a Latin-1 editor (you can't put ž
in a Latin-1 text file, because there is no such character). If you're saving the file as UTF-8 and then telling Python to interpret it as Latin-1, you're going to end up with žžž
instead of žžž
, and similar mojibake.