Python open CSV file with supposedly mixed encodin

I'm trying read a CSV textfile (UTF-8 without BOM according to Notepad++) using Python. However there seems to be a problem with encoding:

print(open(path, encoding="utf-8").read())

Codec can't decode byte 08xf

This little character seems to be the problem: ● (full string: "●• อีเปียขี้บ่น ت •●"), however I'm sure there will be more.

If I try UTF-16, then there is a message:

#also tried with encode
print(open(path, encoding="utf-16").read().encode('utf-8'))

Illegal UTF-16 surrogate

Even when I try opening it with an automatic codec finder I receive the error.

def csv_unireader(f, encoding="utf-8"):
    for row in csv.reader(codecs.iterencode(codecs.iterdecode(f, encoding), "utf-8")):
        yield [e.decode("utf-8") for e in row]

What am I overlooking? The file contains Twitter texts which contain a lot of different characters that's for sure. But this can't be such difficult task in Python, just reading/printing a file?

Edit:

Just tried using the code from this answer: https://stackoverflow.com/a/14786752/45311

import csv

with open('source.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

This at least prints some rows to the screen, but it also throws an error after some rows:

cp850.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 62-63: character maps to

It seems to automatically use CP850 which is another encoding... I can't make sense out of all this....

标签： python csv encoding utf-8 character-encoding

2条回答

等我变得足够好

2楼-- · 2019-09-17 14:26

What is the version of your python? If use the 2.x try to paste the import at the beginning of your script:

from __future__ import unicode_literals

than try:

print(open(path).read().encode('utf-8'))

There is also a great tool for charset detections: chardet. I hope it'll help you.

0人赞添加讨论(0) 举报

走好不送

3楼-- · 2019-09-17 14:49

You can use the errors parameter in the open function. You can try one of the options below (I extracted the descriptions from python documentation):

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

So, you can use:

print(open(path, encoding="utf-8", errors="ignore").read())

0人赞添加讨论(0) 举报

Python open CSV file with supposedly mixed encodin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间