I'm trying to write out a csv file with Unicode characters, so I'm using the unicodecsv package. Unfortunately, I'm still getting UnicodeDecodeErrors:
# -*- coding: utf-8 -*-
import codecs
import unicodecsv
raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')
with codecs.open('test.csv', 'w', 'UTF-8') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
This is the traceback:
Traceback (most recent call last):
File "unicode_test.py", line 11, in <module>
w.writerow(["1", encoded_contents])
File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)
I thought converting it to Unicode would be good enough, but that doesn't seem to be the case. I'd really like to understand what is happening so that I'm better prepared to handle these errors in other projects in the future.
From the traceback, it looks like I can reproduce the error like this:
>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>>
Up until now, I thought I had a decent working knowledge of working with Unicode text in Python 2.x, but this has humbled me.