whenever i try to read UTF-8 encoded text files, even if with open(file_name, encoding='utf-8')
, i always get an error saying ascii codec can't decode some characters (eg. when using for line in f: print(line)
)
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'
>>> import sys
>>> sys.getfilesystemencoding()
'ascii'
>>>
and locale
command prints:
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=en_HK.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I think you are misreading the error message. Be careful at distinguishing UnicodeDecodeError and UnicodeEncodeError.
You say that Python complains that “ascii codec can't decode some characters”. However, there is no such error message, as far as I know. Compare the following two cases:
It's either “can't decode byte” or “can't encode character”, but it's never “can't decode character”.
This might seem pedantic, but in this line,
you have both decoding (before the colon) and encoding (the
print
expression). So you need to be sure which process is causing trouble. One possibility would be to write this in two lines.However, if
f
is opened withencoding='utf-8'
, as you write, then I'm pretty sure the problem is caused by theprint
expression.print()
writes tosys.stdout
by default. Since this stream is already open when Python is started, its encoding is already set as well – depending on your environment. Since in your localeLC_ALL
is not set, the ASCII default (“ANSI X3.4-1968”) is used (this might answer your question in the title).If you can't or don't want to change the locale, here's what you can do to send UTF-8 text to STDOUT from within Python:
use the underlying binary stream:
re-encode
sys.stdout
(actually: replacesys.stdout
with a re-encoded version):In any case, it's still possible that your terminal is unable to properly display UTF-8 text, either because it's uncapable of that or because it's not configured to do so. In that case, you'll probably see question marks or mojibake. But that's a different story, outside of Python's control...
I had a similar problem. For me, initially the environtment variable
LANG
was not set (you can check this by runningenv
)The available locales for me was (on a fresh Ubuntu 18.04 Docker image):
So i picked the utf-8 one:
And then things work
If you pick a locale that is not avaiable, such as
it will not work:
and this is why
locale
is giving the error messages: