From the Python 2.6 shell:
>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>>
I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.
EDIT
I moved the edit to the Answers section and accepted it as suggested.
When Unicode characters are printed to stdout,
sys.stdout.encoding
is used. A non-Unicode character is assumed to be insys.stdout.encoding
and is just sent to the terminal. On my system (Python 2):sys.getdefaultencoding()
is only used when Python doesn't have another option.Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn't support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!
You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the
u
prefix.In the case of
\xe9
then Python assumes your default encoding (Ascii), thus printing ... something blank.As per Python default/implicit string encodings and conversions :
print
ingunicode
, it'sencode
d with<file>.encoding
.encoding
is not set, theunicode
is implicitly converted tostr
(since the codec for that issys.getdefaultencoding()
, i.e.ascii
, any national characters would cause aUnicodeEncodeError
)encoding
is inferred from environment. It's typically set fottty
streams (from the terminal's locale settings), but is likely to not be set for pipesprint u'\xe9'
is likely to succeed when the output is to a terminal, and fail if it's redirected. A solution is toencode()
the string with the desired encoding beforeprint
ing.print
ingstr
, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It's when it can't figure out what's going on that it bugs out.
Thanks to bits and pieces from various replies, I think we can stitch up an explanation.
By trying to print an unicode string, u'\xe9', Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it's been initiated from. If it can't find a proper encoding from the environment, only then does it revert to its default, ASCII.
For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:
Let's for a moment exit the Python shell and set bash's environment with some bogus encoding:
Then start the python shell again and verify that it does indeed revert to its default ascii encoding.
Bingo!
If you now try to output some unicode character outside of ascii you should get a nice error message
Lets exit Python and discard the bash shell.
We'll now observe what happens after Python outputs strings. For this we'll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we'll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn't change the actual shell environment's encoding, it only changes the way the terminal itself will decode output it's given, a bit like a web browser does. You can therefore change the terminal's encoding, independantly from the shell's environment. Let's then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment's encoding (UTF-8 for me):
(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.
(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8". After UTF-8 encoding, the resulting binary string is '\xc3\xa9' (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.
(3) python encodes unicode code point u'\xe9' (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u'\xe9' (233) encoded in latin-1 will also yields the binary string '\xe9'. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields "é" and that's what's displayed.
Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:
(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.
(5) python attempts to implicitly encode the Unicode string with whatever's in sys.stdout.encoding. Still "UTF-8". The resulting binary string is '\xc3\xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".
(6) python encodes unicode string with latin-1, it yields a binary string with the same value '\xe9'. Again, for the terminal this is pretty much the same as case (4).
Conclusions: - Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's.
More details on unicode, UTF-8 and latin-1:
Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it's been decided that key 0xe9 (233) is the value pointing to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to 'A' in ASCII, latin-1 and Unicode, 0xc8 points to 'Ü' in latin-1 and Unicode, 0xe9 points to 'é' in latin-1 and Unicode.
When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That's what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.
Most encoding schemes have shortcomings regarding space requirement, the most economic ones don't cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common "cheap" characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range ('B' which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.
UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.
UTF-8 encoding of unicode code points in the ascii range (0-127):
e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (as we said, it's the same in ASCII). After encoding in UTF-8 it becomes:
UTF-8 encoding of Unicode code points above 127 (non-ascii):
e.g. 'é' Unicode code point is 0xe9 (233).
When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:
The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.
It works for me: