I am trying to read a file and convert the string to a UTF-8
string, in order to remove some non utf-8
chars in the file string,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
but I got the following error,
AttributeError: 'str' object has no attribute 'decode'
Update: I tried the code as suggested by the answer,
file_str = open(file_path, 'r', encoding='utf-8').read()
but it didn't eliminate the non utf-8
chars, so how to remove them?
Remove the .decode('utf8')
call. Your file data has already been decoded, because in Python 3 the open()
call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open()
call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open()
function documentation for further details.