We've already gotten our code base running under Python 2.6. In order to prepare for Python 3.0, we've started adding:
from __future__ import unicode_literals
into our .py
files (as we modify them). I'm wondering if anyone else has been doing this and has run into any non-obvious gotchas (perhaps after spending a lot of time debugging).
The main source of problems I've had working with unicode strings is when you mix utf-8 encoded strings with unicode ones.
For example, consider the following scripts.
two.py
one.py
The output of running
python one.py
is:In this example,
two.name
is an utf-8 encoded string (not unicode) since it did not importunicode_literals
, andone.name
is an unicode string. When you mix both, python tries to decode the encoded string (assuming it's ascii) and convert it to unicode and fails. It would work if you didprint name + two.name.decode('utf-8')
.The same thing can happen if you encode a string and try to mix them later. For example, this works:
Output:
But after adding the
import unicode_literals
it does NOT:Output:
It fails because
'DEBUG: %s'
is an unicode string and therefore python tries to decodehtml
. A couple of ways to fix the print are either doingprint str('DEBUG: %s') % html
orprint 'DEBUG: %s' % html.decode('utf-8')
.I hope this helps you understand the potential gotchas when using unicode strings.
Also in 2.6 (before python 2.6.5 RC1+) unicode literals doesn't play nice with keyword arguments (issue4978):
The following code for example works without unicode_literals, but fails with TypeError:
keywords must be string
if unicode_literals is used.There are more.
There are libraries and builtins that expect strings that don't tolerate unicode.
Two examples:
builtin:
(slightly esotic) doesn't work with unicode_literals: type() expects a string.
library:
doesn't work: the wx pubsub library expects a string message type.
The former is esoteric and easily fixed with
but the latter is devastating if your code is full of calls to pub.sendMessage() (which mine is).
Dang it, eh?!?
I did find that if you add the
unicode_literals
directive you should also add something like:to the first or second line your .py file. Otherwise lines such as:
result in an an error such as:
Also take into account that
unicode_literal
will affecteval()
but notrepr()
(an asymmetric behavior which imho is a bug), i.e.eval(repr(b'\xa4'))
won't be equal tob'\xa4'
(as it would with Python 3).Ideally, the following code would be an invariant, which should always work, for all combinations of
unicode_literals
and Python {2.7, 3.x} usage:The second assertion happens to work, since
repr('\xa4')
evaluates tou'\xa4'
in Python 2.7.