UnicodeDecodeError when using Python 2.x unicodecs

I'm trying to write out a csv file with Unicode characters, so I'm using the unicodecsv package. Unfortunately, I'm still getting UnicodeDecodeErrors:

# -*- coding: utf-8 -*-

import codecs
import unicodecsv

raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')

with codecs.open('test.csv', 'w', 'UTF-8') as f:
    w = unicodecsv.writer(f, encoding='UTF-8')
    w.writerow(["1", encoded_contents])

This is the traceback:

Traceback (most recent call last):
  File "unicode_test.py", line 11, in <module>
    w.writerow(["1", encoded_contents])
  File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
    self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)

I thought converting it to Unicode would be good enough, but that doesn't seem to be the case. I'd really like to understand what is happening so that I'm better prepared to handle these errors in other projects in the future.

From the traceback, it looks like I can reproduce the error like this:

>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>>

Up until now, I thought I had a decent working knowledge of working with Unicode text in Python 2.x, but this has humbled me.

标签： python unicode python-unicode

1条回答

我只想做你的唯一

2楼-- · 2020-04-03 06:35

You should not use codecs.open() for your file. unicodecsv wraps the csv module, which always writes a byte string to the open file object. In order to write that byte string to a Unicode-aware file object such as returned by codecs.open(), it is implicitly decoded; this is where your UnicodeDecodeError exception stems from.

Use a file in binary mode instead:

with open('test.csv', 'wb') as f:
    w = unicodecsv.writer(f, encoding='UTF-8')
    w.writerow(["1", encoded_contents])

The binary mode is not strictly necessary unless your data contains embedded newlines, but the csv module wants to control how newlines are written to ensure that such values are handled correctly. However, not using codecs.open() is an absolute requirement.

The same thing happens when you call .encode() on a byte string; you already have encoded data there, so Python implicitly decodes to get a Unicode value to encode.

0人赞添加讨论(0) 举报

UnicodeDecodeError when using Python 2.x unicodecs

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间