UTF-8 error with Python and gettext

2019-03-29 04:31发布

问题:

I use UTF-8 in my editor, so all strings displayed here are UTF-8 in file.

I have a python script like this:

# -*- coding: utf-8 -*-
...
parser = optparse.OptionParser(
  description=_('automates the dice rolling in the classic game "risk"'), 
  usage=_("usage: %prog attacking defending"))

Then I used xgettext to get everything out and got a .pot file which can be boiled down to:

"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: auto_dice.py:16
msgid "automates the dice rolling in the classic game \"risk\""
msgstr ""

After that, I used msginit to get a de.po which I filled in like this:

"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#: auto_dice.py:16
msgid "automates the dice rolling in the classic game \"risk\""
msgstr "automatisiert das Würfeln bei \"Risiko\""

Running the script, I get the following error:

  File "/usr/lib/python2.6/optparse.py", line 1664, in print_help
    file.write(self.format_help().encode(encoding, "replace"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 60: ordinal not in range(128)

How can I fix that?

回答1:

That error means you've called encode on a bytestring, so it tries to decode it to Unicode using the system default encoding (ascii on Python 2), then re-encode it with whatever you've specified.

Generally, the way to resolve it is to call s.decode('utf-8') (or whatever encoding the strings are in) before trying to use the strings. It might also work if you just use unicode literals: u'automates...' (that depends on how strings are substituted from .po files, which I don't know about).

This sort of confusing behaviour is improved in Python 3, which won't try to convert bytes to unicode unless you specifically tell it to.



回答2:

My suspicion is that the problem is caused by _("string") returning a byte string and not a Unicode string.

The obvious workaround is this:

parser = optparse.OptionParser(
        description=_('automates the dice rolling in the classic game "risk"').decode('utf-8'),
        usage=_("usage: %prog attacking defending").decode('utf-8'))

But that feels wrong.

ugettext or install(True) may help.

The Python gettext docs give these examples:

import gettext
t = gettext.translation('spam', '/usr/share/locale')
_ = t.ugettext

or:

import gettext
gettext.install('myapplication', '/usr/share/locale', unicode=1)

I am trying to reproduce your problem, and even if I use install(unicode=1), I get back a byte string (str type).

Either I am using gettext incorrectly, or I am missing a character coding declaration in my .po/.mo file.

I will update when I know more.

xlt = _('automates the dice rolling in the classic game "risk"')
print type(xlt)
if isinstance(xlt, str):
    print 'gettext returned a str (wrong)'
    print xlt
    print xlt.decode('utf-8').encode('utf-8')
elif isinstance(xlt, unicode):
    print 'gettext returned a unicode (right)'
    print xlt.encode('utf-8')

(One other possibility is to use escapes or Unicode code points in the .po file, but that doesn't sound like fun.)

(Or you could look at your system's .po files to see how they handle non-ASCII characters.)



回答3:

I'm not familiar with this, but it appears to be a known bug in 2.6 that's been fixed in 2.7:

http://bugs.python.org/issue2931

If it's not feasible for you to use 2.7, try this workaround:

http://mail.python.org/pipermail/python-dev/2006-May/065458.html