How to write utf-8 characters to csv file?
My data and code:
# -*- coding: utf-8 -*-
l1 = ["žžž", "ččč"]
l2 = ["žžž", "ččč"]
thelist = [l1, l2]
import csv
import codecs
with codecs.open('test', 'w', "utf-8-sig") as f:
writer = csv.writer(f)
for x in thelist:
print x
for mem in x:
writer.writerow(mem)
The error message:
Traceback (most recent call last):
File "2010rudeni priimti.py", line 263, in <module>
writer.writerow(mem)
File "C:\Python27\lib\codecs.py", line 691, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
File "C:\Python27\lib\encodings\utf_8_sig.py", line 82, in encode
return encode(input, errors)
File "C:\Python27\lib\encodings\utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 11: ordinal not in range(128)
Press any key to continue . . .
What is my mistake?
The
csv
module in 2.x doesn't read/write Unicode, it reads/writes bytes (and assumes they're ASCII-compatible, but that's not a problem with UTF-8).So, when you give it a
codecs
Unicode file to write to, it passes astr
rather than aunicode
. And whencodecs
tries toencode
that to UTF-8, it has to firstdecode
it to Unicode, for which it uses your default encoding, which is ASCII, which fails. Hence this error:The solution is explained in the docs, with a wrapper in Examples that takes care of everything for you. Use the
UnicodeWriter
with a plain binary file, instead of using acodecs
file.As an alternative, there are a few different packages on PyPI that wrap up the
csv
module to deal directly inunicode
instead ofstr
, likeunicodecsv
.As a more radical alternative, Python 3.x's
csv
module doesn't have this problem in the first place (and 3.x also doesn't have the next problem).A much hacker alternative is to just pretend the entire world is UTF-8. After all, both your source code and your output are intended to be UTF-8, and the
csv
module doesn't care about anything but a handful of characters (newlines, commas, maybe quotes and backslashes) being ASCII-compatible. So you could just skip decoding and encoding entirely, and everything will work. The obvious down side here is that if you get anything wrong, instead of getting an error to debug, you will get a file full of garbage.There are two other problems with your code, neither of which
UnicodeWriter
orunicodecsv
can magically fix (although Python 3 can fix the first).First, you're not actually giving the
csv
module Unicode in the first place. The columns in your source data are plain oldstr
literals, like"žžž"
. You can't encode that to UTF-8—or, rather, you can, but only by automatically decoding it as ascii first, which will just cause the same error again. Use Unicode literals, likeu"žžž"
, to avoid this (or, if you prefer, explicitlydecode
from your source encoding… but that's kind of silly).Second, you haven't specified an encoding declaration in your source, but you've used non-ASCII characters. Technically, this is illegal in Python 2.7. Practically, I'm pretty sure it gives you a warning but then treats your source as Latin-1. Which is bad, because you're clearly not using a Latin-1 editor (you can't put
ž
in a Latin-1 text file, because there is no such character). If you're saving the file as UTF-8 and then telling Python to interpret it as Latin-1, you're going to end up withžžž
instead ofžžž
, and similar mojibake.