First I change Windows CMD encoding to utf-8 and run Python interpreter:
chcp 65001
python
Then I try to print a unicode sting inside it and when i do this Python crashes in a peculiar way (I just get a cmd prompt in the same window).
>>> import sys
>>> print u'ëèæîð'.encode(sys.stdin.encoding)
Any ideas why it happens and how to make it work?
UPD: sys.stdin.encoding
returns 'cp65001'
UPD2: It just came to me that the issue might be connected with the fact that utf-8 uses multi-byte character set (kcwu made a good point on that). I tried running the whole example with 'windows-1250' and got 'ëeaî?'. Windows-1250 uses single-character set so it worked for those characters it understands. However I still have no idea how to make 'utf-8' work here.
UPD3: Oh, I found out it is a known Python bug. I guess what happens is that Python copies the cmd encoding as 'cp65001 to sys.stdin.encoding and tries to apply it to all the input. Since it fails to understand 'cp65001' it crashes on any input that contains non-ascii characters.
For me setting this env var before execution of python program worked:
Here's how to alias
cp65001
to UTF-8 without changingencodings\aliases.py
:(IMHO, don't pay any attention to the silliness about
cp65001
not being identical to UTF-8 at http://bugs.python.org/issue6058#msg97731 . It's intended to be the same, even if Microsoft's codec has some minor bugs.)Here is some code (written for Tahoe-LAFS, tahoe-lafs.org) that makes console output work regardless of the
chcp
code page, and also reads Unicode command-line arguments. Credit to Michael Kaplan for the idea behind this solution. If stdout or stderr are redirected, it will output UTF-8. If you want a Byte Order Mark, you'll need to write it explicitly.[Edit: This version uses
WriteConsoleW
instead of the_O_U8TEXT
flag in the MSVC runtime library, which is buggy.WriteConsoleW
is also buggy relative to the MS documentation, but less so.]Finally, it is possible to grant ΤΖΩΤΖΙΟΥ's wish to use DejaVu Sans Mono, which I agree is an excellent font, for the console.
You can find information on the font requirements and how to add new fonts for the windows console in the 'Necessary criteria for fonts to be available in a command window' Microsoft KB
But basically, on Vista (probably also Win7):
HKEY_LOCAL_MACHINE_SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont
, set"0"
to"DejaVu Sans Mono"
;HKEY_CURRENT_USER\Console
, set"FaceName"
to"DejaVu Sans Mono"
.On XP, check the thread 'Changing Command Prompt fonts?' in LockerGnome forums.
Set PYTHONIOENCODING system variable:
Source of
example.py
is simple:I had this annoying issue, too, and I hated not being able to run my unicode-aware scripts same in MS Windows as in linux. So, I managed to come up with a workaround.
Take this script (say,
uniconsole.py
in your site-packages or whatever):This seems to work around the python bug (or win32 unicode console bug, whatever). Then I added in all related scripts:
Finally, I just run my scripts as needed in a console where
chcp 65001
is run and the font isLucida Console
. (How I wish thatDejaVu Sans Mono
could be used instead… but hacking the registry and selecting it as a console font reverts to a bitmap font.)This is a quick-and-dirty
stdout
andstderr
replacement, and also does not handle anyraw_input
related bugs (obviously, since it doesn't touchsys.stdin
at all). And, by the way, I've added thecp65001
alias forutf_8
in theencodings\aliases.py
file of the standard lib.This is because "code page" of cmd is different to "mbcs" of system. Although you changed the "code page", python (actually, windows) still think your "mbcs" doesn't change.
A few comments: you probably misspelled
encodig
and.code
. Here is my run of your example.The conclusion -
cp65001
is not a known encoding for python. Try 'UTF-16' or something similar.