When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)?
And what those "default" (i.e. implied) encodings are?
For example, what are the encodings:
of string literals?
s = "Byte string with national characters" us = u"Unicode string with national characters"
of byte strings when type-converted to and from Unicode?
data = unicode(random_byte_string)
when byte- and Unicode strings are written to/from a file or a terminal?
print(open("The full text of War and Peace.txt").read())
Implicit encoding as internal format to store strings/arrays: you should not care about the encoding. In fact Python decodes characters in Python internal way. It is mostly transparent. Just image that it is a Unicode text, or a sequence of bytes, in abstract way.
The internal coding on Python 3.x varies according the "larger" character. It could be UTF-8/ASCII (for ASCII strings), UTF-16 or UTF-32. When you are using strings, it is like you have a Unicode String (so abstract, not a real encoding). If you do not program in C or you use some special functions (memory view), you will never be able to see the internal encoding.
Bytes are just a view of actual memory. Python interprets is as
unsigned char
. But again, often you should just think about what the sequence it is, not on internal encoding.Python2 has bytes and string as unsigned char, and unicode as UCS-2 (so code points above 65535 will be coded with 2 characters (UCS2) in Python2, and just one character (UTF-32) in Python3)
There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.
Short answer:
str
(Py2) -- not applicable, raw bytes from the file are takenunicode
(Py2)/str
(Py3) -- "source encoding", defaults areascii
(Py2) andutf-8
(Py3)bytes
(Py3) -- none, non-ascii characters are prohibited in the literalsys.getdefaultencoding()
(ascii
almost always)UnicodeDecodeError
/UnicodeEncodeError
unicode
(Py2) --<file>.encoding
if set, otherwisesys.getdefaultencoding()
str
(Py2) -- not applicable, raw bytes are writtenstr
(Py3) --<file>.encoding
, always set and defaults tolocale.getpreferredencoding()
bytes
(Py3) -- none,print
ing produces itsrepr()
insteadFirst of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.
Now...
Reading the source and parsing string literals
At the start of a source file, you can specify the file's "source encoding" (its exact effect is decribed later). If not specified, the default is
ascii
for Python 2 andutf-8
for Python 3. A UTF-8 BOM has the same effect as autf-8
encoding declaration.Python 2
Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood but this is the net effect.)
So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".
If the decoding fails, you will get a
SyntaxError
. Same if there is a non-ascii character in the file when there's no encoding specified. Finally, ifunicode_literals
future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.Python 3
Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ascii characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.
Transcoding
As per the clarification at the start:
str
-- bytes => can only bedecode
d (directly, that is; details follow)unicode
-- characters => can only beencode
dPython 2
In both cases, if the encoding is not specified,
sys.getdefaultencoding()
is used. It isascii
(unless you uncomment a code chunk insite.py
, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding,sys.getdefaultencoding()
is the "string's default encoding".Now, here's a caveat:
a
decode()
andencode()
-- with the default encoding -- is done implicitly when convertingstr<->unicode
:UnicodeDecodeError
/UnicodeEncodeError
questions on SO are about this)encode()
astr
ordecode()
aunicode
(the 2nd third of the SO questions)Python 3
There's no "default encoding" at all: implicit conversion between
str
andbytes
is now prohibited.(As the number of SO questions from confused users testify, it proved to be more trouble than it's worth.)
bytes
can only bedecode
d andstr
--encode
d, and theencoding
argument is mandatory.bytes->str
(incl. implicitly) produces itsrepr()
instead (which is only useful for printing), evading the encoding issue entirelystr->bytes
is prohibitedPrinting
This matter is unrelated to a variable's value but related to what you would see on the screen when it's
print
ed -- and whether you will get aUnicodeEncodeError
whenprint
ing.Python 2
unicode
isencode
d with<file>.encoding
if set; otherwise, it's implicitly converted tostr
as per the above. (The final third of theUnicodeEncodeError
SO questions fall into here.)PYTHONIOENCODING
envvar.str
's bytes are sent to the OS stream as-is. What specific characters you will see on the screen depends on your terminal's encoding (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).Python 3
The changes are:
file
s opened with text vs binarymode
natively acceptstr
orbytes
, correspondingly, and outright refuse to process the wrong type. Text-mode files always have anencoding
set,locale.getpreferredencoding(False)
being the default.print
for text streams still implicitly converts everything tostr
, which in the case ofbytes
prints itsrepr()
as per the above, evading the encoding issue altogether