There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.
I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is a full Traceback
:
Traceback (most recent call last):
File "keywords.py", line 31, in <module>
main()
File "keywords.py", line 28, in main
get_csv(file_full_path)
File "keywords.py", line 19, in get_csv
for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
With the help of Stack Overflow, I got it open with:
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
Now the problem is that when I am reading the file:
def get_csv(file_full_path):
import csv, codecs
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
for row in reader:
print row
I get stuck on Asian symbols:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
I have tried decode
, 'encode', unicode()
on the string containing that character, but it does not seem help.
for row in reader:
#decoded_row = [element_s.decode('UTF-8') for element_s in row]
#print decoded_row
encoded_row = [element_s.encode('UTF-8') for element_s in row]
print encoded_row
At this point I do not really understand why. If I
>>> print u'\u5a07'
娇
or
>>> print '娇'
娇
it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs
using UTF-16.
I am not sure where to go from here. Could anyone help out?
The
csv
module can not handle Unicode input. It says so specifically on its documentation page:You need to convert your CSV file to UTF-8 so that the module can deal with it:
Alternatively, you can use the command-line utility
iconv
to convert the file for you.Then use that re-coded file to read your data:
Note that the columns then need decoding to unicode manually.
Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.
You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.
Change your opening to this:
And you should be fine. Or even better:
However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.