Python open CSV file with supposedly mixed encodin

2019-09-17 13:47发布

问题:

I'm trying read a CSV textfile (UTF-8 without BOM according to Notepad++) using Python. However there seems to be a problem with encoding:

print(open(path, encoding="utf-8").read())

Codec can't decode byte 08xf

This little character seems to be the problem: (full string: "●• อีเปียขี้บ่น ت •●"), however I'm sure there will be more.

If I try UTF-16, then there is a message:

#also tried with encode
print(open(path, encoding="utf-16").read().encode('utf-8'))

Illegal UTF-16 surrogate

Even when I try opening it with an automatic codec finder I receive the error.

def csv_unireader(f, encoding="utf-8"):
    for row in csv.reader(codecs.iterencode(codecs.iterdecode(f, encoding), "utf-8")):
        yield [e.decode("utf-8") for e in row]

What am I overlooking? The file contains Twitter texts which contain a lot of different characters that's for sure. But this can't be such difficult task in Python, just reading/printing a file?

Edit:

Just tried using the code from this answer: https://stackoverflow.com/a/14786752/45311

import csv

with open('source.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

This at least prints some rows to the screen, but it also throws an error after some rows:

cp850.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 62-63: character maps to

It seems to automatically use CP850 which is another encoding... I can't make sense out of all this....

回答1:

What is the version of your python? If use the 2.x try to paste the import at the beginning of your script:

from __future__ import unicode_literals

than try:

print(open(path).read().encode('utf-8'))

There is also a great tool for charset detections: chardet. I hope it'll help you.



回答2:

You can use the errors parameter in the open function. You can try one of the options below (I extracted the descriptions from python documentation):

  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

So, you can use:

print(open(path, encoding="utf-8", errors="ignore").read())