When I'm writing sysadmin scripts in Python, the buffer on sys.stdout that effects every call to print() is annoying, because I don't want to wait for a buffer to be flushed and then get a big chunk of lines at once on the screen, instead I want to get individually lines of output as soon as new output is generated by the script. I don't even want to wait for newlines so see the output.
An often used idiom to do this in python is
import os
import sys
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
This worked fine for me for a long time. Now I noticed, that it doesn't work with Unicode. Please see the following script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import os
import sys
print('Original encoding: {}'.format(sys.stdout.encoding))
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
print('New encoding: {}'.format(sys.stdout.encoding))
text = b'Eisb\xe4r'
print(type(text))
print(text)
text = text.decode('latin-1')
print(type(text))
print(text)
This leads to the following output:
Original encoding: UTF-8
New encoding: None
<type 'str'>
Eisb▒r
<type 'unicode'>
Traceback (most recent call last):
File "./export_debug.py", line 18, in <module>
print(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 4: ordinal not in range(128)
It took me hours to track down the reason for it (my original script was much longer than this minimal debugging script). It is the line
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
which I used for years so didn't expect any problem with it. Just comment out this line and the correct output should look like this:
Original encoding: UTF-8
New encoding: UTF-8
<type 'str'>
Eisb▒r
<type 'unicode'>
Eisbär
So what is the script ment to do? To prepare my Python 2.7 code as close as possible to Python 3.x, I'm always using
from __future__ import print_function, unicode_literals
which makes python use the new print()-function but more important: it makes Python store all strings as Unicode internally by default. I have a lot of Latin-1 / ISO-8859-1 encoded data, for example
text = b'Eisb\xe4r'
To work with it the intended way, I need to decode it to Unicode first, that's what
text = text.decode('latin-1')
is for. As the default encoding is UTF-8 on my system, whenever I print a string, python encodes the internal Unicode string to UTF-8 then. But first it has to be in perfect Unicode internally.
Now that all works fine in general, just not with a zero byte output buffer so far. Any ideas? I noticed that sys.stdout.encoding is unset after the zero-buffering line, but I don't know how to set it again. It is a read-only attribute and the OS environment variables LC_ALL or LC_CTYPE seem to be evaluated only at the start of the python interpreter.
Btw.: 'Eisbär' is the German word for 'polar bear'.