read a file and try to remove all non UTF-8 chars

2019-09-05 23:09发布

问题:

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

but I got the following error,

AttributeError: 'str' object has no attribute 'decode'

Update: I tried the code as suggested by the answer,

file_str = open(file_path, 'r', encoding='utf-8').read()

but it didn't eliminate the non utf-8 chars, so how to remove them?

回答1:

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.

You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.

See the open() function documentation for further details.