I'm trying to find a generic solution to print unicode strings from a python script.
The requirements are that it must run in both python 2.7 and 3.x, on any platform, and with any terminal settings and environment variables (e.g. LANG=C or LANG=en_US.UTF-8).
The python print function automatically tries to encode to the terminal encoding when printing, but if the terminal encoding is ascii it fails.
For example, the following works when the environment "LANG=enUS.UTF-8":
x = u'\xea'
print(x)
But it fails in python 2.7 when "LANG=C":
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 0: ordinal not in range(128)
The following works regardless of the LANG setting, but would not properly show unicode characters if the terminal was using a different unicode encoding:
print(x.encode('utf-8'))
The desired behavior would be to always show unicode in the terminal if it is possible and show some encoding if the terminal does not support unicode. For example, the output would be UTF-8 encoded if the terminal only supported ascii. Basically, the goal is to do the same thing as the python print function when it works, but in the cases where the print function fails, use some default encoding.
You can handle the
LANG=C
case by tellingsys.stdout
to default to UTF-8 in cases when it would otherwise default to ASCII.The above snippet fulfills your requirements: it works in Python 2.7 and 3.4, and it doesn't break when
LANG
is in a non-UTF-8 setting such asC
.It is not a new technique, but it's surprisingly hard to find in the documentation. As presented above, it actually respects non-UTF-8 settings such as
ISO 8859-*
. It only defaults to UTF-8 if Python would have bogusly defaulted to ASCII, breaking the application.I don't think you should try and solve this at the Python level. Document your application requirements, log the locale of systems you run on so it can be included in bug reports and leave it at that.
If you do want to go this route, at least distinguish between terminals and pipes; you should never output data to a terminal that the terminal cannot explicitly handle; don't output UTF-8 for example, as the non-printable codepoints > U+007F could end up being interpreted as control codes when encoded.
For a pipe, output UTF-8 by default and make it configurable.
So you'd detect if a TTY is being used, then handle encoding based on that; for a terminal, set an error handler (pick one of
replace
orbackslashreplace
to provide replacement characters or escape sequences for whatever characters cannot be handled). For a pipe, use a configurable codec.You can handle the exception:
You can encode the string yourself with the special parameter
'backslashreplace'
so that unrepresentable characters are converted to escape sequences. In Python 2 you can print the result ofencode
directly, but in Python 3 you need todecode
it back to Unicode first.If
sys.stdout.encoding
doesn't deliver the value that your terminal can handle, that's a separate problem that you must deal with.