Windows console encoding

2019-07-31 13:28发布

问题:

What is the default console encoding on Windows? It seems like sometimes it is the ANSI encoding (CP-1252), sometimes it is the OEM encoding (CP-850 for Western Europe by default) given by the chcp command.

  • Command-line arguments and environment variables trigger the ANSI encoding (é = 0xe9):

    > chcp 850
    Active code page: 850
    > python -c "print 'é'"
    Ú
    > python -c "print '\x82'"
    é
    > python -c "print '\xe9'"
    Ú
    > $env:foobar="é"; python -c "import os; print os.getenv('foobar')"
    Ú
    
    > chcp 1252
    Active code page: 1252
    > python -c "print 'é'"
    é
    > python -c "print '\x82'"
    ,
    > python -c "print '\xe9'"
    é
    > $env:foobar="é"; python -c "import os; print os.getenv('foobar')"
    é
    
  • Python console and standard input trigger the OEM encoding (é = 0x82 if the OEM encoding is CP-850, é = 0xe9 if the OEM encoding is CP-1252):

    > chcp 850
    Active code page: 850
    > python
    >>> print 'é'
    é
    >>> print '\x82'
    é
    >>> print '\xe9'
    Ú
    > python -c "print raw_input()"
    é
    é
    
    > chcp 1252
    Active code page: 1252
    > python
    >>> print 'é'
    é
    >>> print '\x82'
    ,
    >>> print '\xe9'
    é
    > python -c "print raw_input()"
    é
    é
    

Note. – In these examples, I used Powershell 5.1 and CPython 2.7.14 on Windows 10.

回答1:

First of all, for all your non-ASCII characters, what matters here is your console encoding and Windows locale settings, you are using byte strings and Python just prints out the bytes it received. Your keyboard input is encoded to a specific byte or byte sequence by the console before those bytes are passed on to Python. To Python, this is all just opaque data (numbers in the range 0-255), and print passes those back to the console the same way Python received them.

In Powershell, what encoding is used for the bytes sent to Python via command-line switches is not determined by the chcp codepage, but by the Language for non-Unicode programs setting in your control panel (search for Region, then find the Administrative tab). It is this setting that encodes é to 0xE9 before passing it to Python as a command-line argument. There are a large number of Windows codepages that use 0xE9 for é (but there is no such thing as an ANSI encoding).

The same applies to environment variables. Python refers to the encoding Windows uses here as the MBCS codec; you can decode command-line parameters or environment variables to Unicode using the 'mbcs' codec, which uses the MultiByteToWideChar() and WideCharToMultiByte() Windows API functions, with the CP_ACP flag.

When using the interactive prompt, Python is passed bytes as encoded by the Powershell console locale codepage, set with chcp. For you that's codepage 850, and a byte with the hex value 0x82 is received when you type é. Because print sends the same 0x82 byte back to the same console, the console then translates 0x82 back to a é character on the screen.

Only when you use Unicode text (with a unicode string literal like u'é') would Python do any decoding and encoding of the data. print writes to sys.stdout, which is configured to encode Unicode data to the current locale (or PYTHONIOENCODING if set), so print u'é' would write that Unicode object to sys.stdout, which then encodes that object to bytes using the configured codec, and those bytes are then written to the console.

To produce the unicode object from the u'é' source code text (itself a sequence of bytes), Python does have to decode the source code given. For the -c command line, the bytes that are passed in are decoded as Latin-1. In the interactive console, the locale is used. So python -c "print u'é'" and print u'é' in the interactive session will result in different output.

It should be noted that Python 3 uses Unicode strings throughout, and command-line parameters and environment variables are loaded into Python with the Windows 'wide' APIs to access the data as UTF-16, then presented as Unicode string objects. You can still access console data and filesystem information as byte strings, but as of Python 3.6, accessing the filesystem and stdin/stdout/stderr streams as binary uses UTF-8 encoded data (again using the 'wide' APIs).