While porting code from Python 2 to Python 3, I run into this problem when reading UTF-8 text from standard input. In Python 2, this works fine:
for line in sys.stdin:
...
But Python 3 expects ASCII from sys.stdin, and if there are non-ASCII characters in the input, I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte .. in position ..: ordinal not in range(128)
For a regular file, I would specify the encoding when opening the file:
with open('filename', 'r', encoding='utf-8') as file:
for line in file:
...
But how can I specify the encoding for standard input? Other SO posts have suggested using
input_stream = codecs.getreader('utf-8')(sys.stdin)
for line in input_stream:
...
However, this doesn't work in Python 3. I still get the same error message. I'm using Ubuntu 12.04.2 and my locale is set to en_US.UTF-8.
Python 3 does not expect ASCII from sys.stdin
. It'll open stdin
in text mode and make an educated guess as to what encoding is used. That guess may come down to ASCII
, but that is not a given. See the sys.stdin
documentation on how the codec is selected.
Like other file objects opened in text mode, the sys.stdin
object derives from the io.TextIOBase
base class; it has a .buffer
attribute pointing to the underlying buffered IO instance (which in turn has a .raw
attribute).
Wrap the sys.stdin.buffer
attribute in a new io.TextIOWrapper()
instance to specify a different encoding:
import io
import sys
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
Alternatively, set the PYTHONIOENCODING
environment variable to the desired codec when running python.
From Python 3.7 onwards, you can also reconfigure the existing std*
wrappers, provided you do it at the start (before any data has been read):
# Python 3.7 and newer
sys.stdin.reconfigure(encoding='utf-8')