The following code works in Python 3:
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
And produces the following output:
Nicholas Gyeney, André
Writers: Nicholas Gyeney, André
In Python 2.7, though, I get the following error:
Traceback (most recent call last):
File "python", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'
in position 21: ordinal not in range(128)
I can fix this error by changing ", ".join(people)
to ", ".join(people).encode('utf-8')
, but if I do so, the output in Python 3 changes to:
b'Nicholas Gyeney, Andr\xc3\xa9'
Writers: b'Nicholas Gyeney, Andr\xc3\xa9'
So I tried to use the following code:
if sys.version_info < (3, 0):
reload(sys)
sys.setdefaultencoding('utf-8')
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
Which makes my code work in all versions of Python. But I read that using setdefaultencoding
is discouraged.
What's the best approach to deal with this issue?
The answer is to make everything unicode:
You could provide the Unicode prefix when formatting:
this does deal with the issue but, you are littering your Python 3 script with unnecessary
u''
prefixes.You could also
from __future__ import unicode_literals
after checking the version but I wouldn't do that, it is generally trickier to work with and has been considered for deprecation since theu''
prefix does the job sufficiently.In Python2 you should use unicode strings for
join
andprint
:First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).
As you have already read,
setdefaultencoding
is discouraged and actually not needed in your case.To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:
# -*- coding: utf-8 -*-
(only if you have string literals with unicode text in your code)Here is how I changed your example by following those rules:
which outputs:
Here is what changed:
\xe9
with the actual Unicode character (é
)u
prefixesIt works just nicely in Python 2.7.12 and 3.5.2.
But be warned that removing the
u
prefixes will make python use regularstr
type instead ofunicode
(see output ofprint(type(writers))
). In case ofutf-8
it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this examplelen
returns23
, where the actual number of characters is22
. This is because the underlying type isstr
, which counts each byte as a character, but characteré
should actually be two bytes.In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the
u
prefix or convert the data to unicode type excplicitly, before string manipulation.So, if it was not for your simple example, it would be better to still use the
u
prefix. You need that in two places:which outputs:
Note:
u
prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.
Here is an excerpt for the special comment specifying file encoding:
If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.
Another detail worth mentioning is that
format
always returns anascii
string if the format string wasascii
, no matter that the arguments were inunicode
.Contrary to that, old style formatting with
%
returns aunicode
string if any of the arguments areunicode
. So instead of writing thisyou could write this, which is not only shorter and prettier, but works in both Python 2 and 3: