[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are king (or queen).
[Update 2] Thanks for the excellent answers and discussion. What I need to do with these is to read them in, parse them, and save parts of them in Django model instances. I believe that means converting them from their native encoding to unicode so Django can deal with them, right?
There are several questions on Stackoverflow already on the subject of non-ascii python CSV reading, but the solutions shown there and in the python documentation don't work with the input files I'm trying.
The gist of the solution seems to be to encode('utf-8') the input to the CSV reader and unicode(item, 'utf-8') the output of the reader. However, this runs into UnicodeDecodeError issues (see above questions):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected
The input file is not necessarily in utf8; it can be ISO-8859-1, cp1251, or just about anything else.
So, the question: what's a resilient, cross-encoding capable way to read CSV files in Python?
The root of the issue seems to be that the CSV module is a C extension; is there a pure-python CSV reading module?
If not, is there a way to confidently detect the encoding of the input file so that it can be processed?
Basically I'm looking for a bullet proof way to read (and hopefully write) CSV files in any encoding.
Here are two sample files: European, Russian.
And here's the recommended solution failing:
Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
... # csv.py doesn't do Unicode; encode temporarily as UTF-8:
... csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
... dialect=dialect, **kwargs)
... for row in csv_reader:
... # decode UTF-8 back to Unicode, cell by cell:
... yield [unicode(cell, 'utf-8') for cell in row]
...
>>> def utf_8_encoder(unicode_csv_data):
... for line in unicode_csv_data:
... yield line.encode('utf-8')
...
>>> r = unicode_csv_reader(file('sample-euro.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in unicode_csv_reader
File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 14: ordinal not in range(128)
>>> r = unicode_csv_reader(file('sample-russian.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in unicode_csv_reader
File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 28: ordinal not in range(128)
You are attempting to apply a solution to a different problem. Note this:
def utf_8_encoder(unicode_csv_data)
You are feeding it str
objects.
The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:
for row in csv.reader("foo.csv", delimiter=known_delimiter):
row = [item.decode(encoding) for item in row]
Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".
Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.
Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""
You must know the encoding for ANY file-reading exercise to work.
Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252
for your euro file and windows-1251
for your Russian file -- a fantastic achievement given their tiny size.
Update 2 in response to """working code would be most welcome"""
Working code (Python 2.x):
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()
def charset_detect(f, chunk_size=4096):
global chardet_detector
chardet_detector.reset()
while 1:
chunk = f.read(chunk_size)
if not chunk: break
chardet_detector.feed(chunk)
if chardet_detector.done: break
chardet_detector.close()
return chardet_detector.result
# Exercise for the reader: replace the above with a class
import csv
import sys
from pprint import pprint
pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)
with open(pathname, 'rb') as f:
cd_result = charset_detect(f)
encoding = cd_result['encoding']
confidence = cd_result['confidence']
print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
# insert actions contingent on encoding and confidence here
f.seek(0)
csv_reader = csv.reader(f, delimiter=delim)
for bytes_row in csv_reader:
unicode_row = [x.decode(encoding) for x in bytes_row]
pprint(unicode_row)
Output 1:
delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
u'Overf\xf8rsel utland',
u'UTLBET; ID 9710032001647082',
u'1990.00',
u'']
[u'31-01-11',
u'Overf\xf8ring',
u'OVERF\xd8RING MELLOM EGNE KONTI',
u'5750.00',
u';']
Output 2:
delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
u'04.02.2011 23:20',
u'300,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421',
u'']
[u'-',
u'04.02.2011 23:15',
u'450,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
u'']
[u'-',
u'13.01.2011 02:05',
u'100,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421 kolombina',
u'']
Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd
.
I don't know if you've already tried this, but in the example section for the official Python documentation for the csv module, you'll find a pair of classes; UnicodeReader
and UnicodeWriter
. They worked fine for me so far.
Correctly detecting the encoding of a file seems to be a very hard problem. You can read the discussion in this StackOverflow thread.
You are doing the wrong thing in your code by trying to .encode('utf-8')
, you should be decoding it instead. And btw, unicode(bytestr, 'utf-8')
== bytestr.decode('utf-8')
But most importantly, WHY are you trying to decode the strings?
Sounds a bit absurd but you can actually work with those CSV without caring whether they are cp1251, cp1252 or utf-8. The beauty of it all is that the regional characters are >0x7F and utf-8 too, uses sequences of >0x7F characters to represent non-ASCII symbols.
Since the separators CSV cares about (be it , or ; or \n) are within ASCII, its work won't be affected by the encoding used (as long as it is one-byte or utf-8!).
Important thing to note is that you should give to Python 2.x csv
module files opened in binary
mode - that is either 'rb' or 'wb' - because of the peculiar way it was implemented.
What you are asking is impossible. There is no way to write a program in any language that will accept input in an unknown encoding and correctly convert it to Unicode internal representation.
You have to find a way to tell the application which encoding to use.
It is possible to recognize many, but not all, encodingshardet but it really depends on what the content of the files is and whether there are enough data points. This is similar to the issue of correctly decoding filenames on network servers. When a file is created on a network server, there is no way to tell the server what encoding is used, so if you have a folder with names in multiple encodings they are guaranteed to look odd to some, if not all, users and different files will seem odd.
However, don't give up. Try the chardet encoding detector mentioned in this question: https://serverfault.com/questions/82821/how-to-tell-the-language-encoding-of-a-filename-on-linux
and if you are lucky, you won't get many failures.