可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.

I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Here is a full Traceback:

Traceback (most recent call last):
  File "keywords.py", line 31, in <module>
    main()
  File "keywords.py", line 28, in main
    get_csv(file_full_path)
  File "keywords.py", line 19, in get_csv
    for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10:    ordinal    not in range(128)

With the help of Stack Overflow, I got it open with:

reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')

Now the problem is that when I am reading the file:

def get_csv(file_full_path):
    import csv, codecs
    reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
    for row in reader:
        print row

I get stuck on Asian symbols:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)

I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.

for row in reader:
    #decoded_row = [element_s.decode('UTF-8') for element_s in row]
    #print decoded_row
    encoded_row = [element_s.encode('UTF-8') for element_s in row]
    print encoded_row

At this point I do not really understand why. If I

>>> print u'\u5a07'
娇

>>> print '娇'
娇

it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.

I am not sure where to go from here. Could anyone help out?

回答1:

The csv module can not handle Unicode input. It says so specifically on its documentation page:

Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;

You need to convert your CSV file to UTF-8 so that the module can deal with it:

with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
    with open(file_full_path + '.utf8', 'wb') as outfile:
        for line in infile:
            outfile.write(line.encode('utf8'))

Alternatively, you can use the command-line utility iconv to convert the file for you.

Then use that re-coded file to read your data:

 reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
 for row in reader:
     print [c.decode('utf8') for c in row]

Note that the columns then need decoding to unicode manually.

回答2:

Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.

You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.

Change your opening to this:

reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')

And you should be fine. Or even better:

with open(file_full_path, 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t', quotechar='"')
    # CVS handling here.

However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.