I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py
and then with ./test.py >out.txt
:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni
Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?
The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:
Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and
Encode it while printing
This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.
Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the
PYTHONIOENCODING
environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).
Example 1
Note that the
#coding
comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.Output (run directly from terminal)
Python correctly determined the encoding of the terminal.
Output (redirected to file)
Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.
Output (redirected to file, PYTHONIOENCODING=cp437)
and my output file was correct:
Example 2
Now I'll throw in a character in the source that isn't supported by my terminal:
Output (run directly from terminal)
My terminal didn't understand that last Chinese character.
Output (run directly, PYTHONIOENCODING=437:replace)
Error handlers can be specified with the encoding. In this case unknown characters were replaced with
?
.ignore
andxmlcharrefreplace
are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.