UnicodeDecodeError in Python when reading a file,

I have to read a text file into Python. The file encoding is:

file -bi test.csv 
text/plain; charset=us-ascii

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

Is there a way to avoid this. If I try this:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

Is there any way to do this? Thank you very much.

标签： python python-3.x utf-8

1条回答

Rolldiameter

2楼-- · 2019-04-19 05:58

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

You can tell open() how to treat decoding errors, with the errors keyword:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.

'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.

'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

Opening the file with 'ignore' or 'replace' will then let you read the file without exceptions being raised.

0人赞添加讨论(0) 举报

UnicodeDecodeError in Python when reading a file,

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间