UTF-8 In Python logging, how?

2019-01-21 21:57发布

I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example:

import logging

def logging_test():
    handler = logging.FileHandler("/home/ted/logfile.txt", "w",
                                  encoding = "UTF-8")
    formatter = logging.Formatter("%(message)s")
    handler.setFormatter(formatter)
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # This is an o with a hat on it.
    byte_string = '\xc3\xb4'
    unicode_string = unicode("\xc3\xb4", "utf-8")

    print "printed unicode object: %s" % unicode_string

    # Explode
    root_logger.info(unicode_string)

if __name__ == "__main__":
    logging_test()

This explodes with UnicodeDecodeError on the logging.info() call.

At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this:

file_handler.write(unicode_string.encode("UTF-8"))

When it should be doing this:

file_handler.write(unicode_string)

Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.

4条回答
叼着烟拽天下
2楼-- · 2019-01-21 22:41

Having code like:

raise Exception(u'щ')

Caused:

  File "/usr/lib/python2.7/logging/__init__.py", line 467, in format
    s = self._fmt % record.__dict__
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

This happens because the format string is a byte string, while some of the format string arguments are unicode strings with non-ASCII characters:

>>> "%(message)s" % {'message': Exception(u'\u0449')}
*** UnicodeEncodeError: 'ascii' codec can't encode character u'\u0449' in position 0: ordinal not in range(128)

Making the format string unicode fixes the issue:

>>> u"%(message)s" % {'message': Exception(u'\u0449')}
u'\u0449'

So, in your logging configuration make all format string unicode:

'formatters': {
    'simple': {
        'format': u'%(asctime)-s %(levelname)s [%(name)s]: %(message)s',
        'datefmt': '%Y-%m-%d %H:%M:%S',
    },
 ...

And patch the default logging formatter to use unicode format string:

logging._defaultFormatter = logging.Formatter(u"%(message)s")
查看更多
混吃等死
3楼-- · 2019-01-21 22:44

Check that you have the latest Python 2.6 - some Unicode bugs were found and fixed since 2.6 came out. For example, on my Ubuntu Jaunty system, I ran your script copied and pasted, removing only the '/home/ted/' prefix from the log file name. Result (copied and pasted from a terminal window):

vinay@eta-jaunty:~/projects/scratch$ python --version
Python 2.6.2
vinay@eta-jaunty:~/projects/scratch$ python utest.py 
printed unicode object: ô
vinay@eta-jaunty:~/projects/scratch$ cat logfile.txt 
ô
vinay@eta-jaunty:~/projects/scratch$ 

On a Windows box:

C:\temp>python --version
Python 2.6.2

C:\temp>python utest.py
printed unicode object: ô

And the contents of the file:

alt text

This might also explain why Lennart Regebro couldn't reproduce it either.

查看更多
贪生不怕死
4楼-- · 2019-01-21 22:44

If I understood your problem correctly, the same issue should arise on your system when you do just:

str(u'ô')

I guess automatic encoding to the locale encoding on Unix will not work until you have enabled locale-aware if branch in the setencoding function in your site module via locale. This file usually resides in /usr/lib/python2.x, it worth inspecting anyway. AFAIK, locale-aware setencoding is disabled by default (it's true for my Python 2.6 installation).

The choices are:

  • Let the system figure out the right way to encode Unicode strings to bytes or do it in your code (some configuration in site-specific site.py is needed)
  • Encode Unicode strings in your code and output just bytes

See also The Illusive setdefaultencoding by Ian Bicking and related links.

查看更多
Emotional °昔
5楼-- · 2019-01-21 22:49

Try this:

import logging

def logging_test():
    log = open("./logfile.txt", "w")
    handler = logging.StreamHandler(log)
    formatter = logging.Formatter("%(message)s")
    handler.setFormatter(formatter)
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # This is an o with a hat on it.
    byte_string = '\xc3\xb4'
    unicode_string = unicode("\xc3\xb4", "utf-8")

    print "printed unicode object: %s" % unicode_string

    # Explode
    root_logger.info(unicode_string.encode("utf8", "replace"))


if __name__ == "__main__":
    logging_test()

For what it's worth I was expecting to have to use codecs.open to open the file with utf-8 encoding but either that's the default or something else is going on here, since it works as is like this.

查看更多
登录 后发表回答