The following code works in Python 3:
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
And produces the following output:
Nicholas Gyeney, André
Writers: Nicholas Gyeney, André
In Python 2.7, though, I get the following error:
Traceback (most recent call last):
File "python", line 4, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'
in position 21: ordinal not in range(128)
I can fix this error by changing ", ".join(people)
to ", ".join(people).encode('utf-8')
, but if I do so, the output in Python 3 changes to:
b'Nicholas Gyeney, Andr\xc3\xa9'
Writers: b'Nicholas Gyeney, Andr\xc3\xa9'
So I tried to use the following code:
if sys.version_info < (3, 0):
reload(sys)
sys.setdefaultencoding('utf-8')
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
Which makes my code work in all versions of Python. But I read that using setdefaultencoding
is discouraged.
What's the best approach to deal with this issue?
First we assume that you want to support Python 2.7 and 3.5 versions (2.6 and 3.0 to 3.2 are handled a bit differently).
As you have already read, setdefaultencoding
is discouraged and actually not needed in your case.
To write cross platform code dealing with unicode text, you generally only need to specify string encoding at several places:
- At top of your script, below the shebang with
# -*- coding: utf-8 -*-
(only if you have string literals with unicode text in your code)
- When you read input data (eg. from text file or database)
- When you output data (again from text file or database)
- When you define a string literal in code
Here is how I changed your example by following those rules:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
people = ['Nicholas Gyeney', 'André']
writers = ", ".join(people)
print(writers)
print("Writers: {}".format(writers))
print(type(writers))
print(len(writers))
which outputs:
<type 'str'>
23
Here is what changed:
- Specified file encoding at top of file
- Replaced
\xe9
with the actual Unicode character (é
)
- Removed
u
prefixes
It works just nicely in Python 2.7.12 and 3.5.2.
But be warned that removing the u
prefixes will make python use regular str
type instead of unicode
(see output of print(type(writers))
). In case of utf-8
it works in most places as if it were a unicode string, but when checking the text length a wrong value will be returned. In this example len
returns 23
, where the actual number of characters is 22
. This is because the underlying type is str
, which counts each byte as a character, but character é
should actually be two bytes.
In other words this works when outputing data fine (as in your example), but not if you want to do string manipulation on the text. In this case, you still need to use the u
prefix or convert the data to unicode type excplicitly, before string manipulation.
So, if it was not for your simple example, it would be better to still use the u
prefix. You need that in two places:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
people = [u'Nicholas Gyeney', u'André']
writers = ", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))
print(type(writers))
print(len(writers))
which outputs:
<type 'unicode'>
22
Note: u
prefix was removed in Python 3.0 and then reintroduced again in Python 3.3 for backward compatibility.
Detailed explanation of all intricacies of working with unicode text in Python 2 is available in official documentation: Python 2 - Unicode HOWTO.
Here is an excerpt for the special comment specifying file encoding:
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = u'abcdé' print ord(u[-1])
The syntax is inspired by Emacs’s notation for specifying variables
local to a file. Emacs supports many different variables, but Python
only supports coding
. The -*-
symbols indicate to Emacs that the
comment is special; they have no significance to Python but are a
convention. Python looks for coding: name
or coding=name
in the
comment.
If you don’t include such a comment, the default encoding used will be
ASCII.
If you get get hold of the book "Learning Python, 5th Edition", I encourage you to read Chapter 37 "Unicode and Byte Strings" in Part VIII. Advanced Topics. It contains detailed explanation for working with Unicode text in both generations of Python.
Another detail worth mentioning is that format
always returns an ascii
string if the format string was ascii
, no matter that the arguments were in unicode
.
Contrary to that, old style formatting with %
returns a unicode
string if any of the arguments are unicode
. So instead of writing this
print(u"Writers: {}".format(writers))
you could write this, which is not only shorter and prettier, but works in both Python 2 and 3:
print("Writers: %s" % writers)
You could provide the Unicode prefix when formatting:
print(u"Writers: {}".format(writers))
this does deal with the issue but, you are littering your Python 3 script with unnecessary u''
prefixes.
You could also from __future__ import unicode_literals
after checking the version but I wouldn't do that, it is generally trickier to work with and has been considered for deprecation since the u''
prefix does the job sufficiently.
In Python2 you should use unicode strings for join
and print
:
people = [u'Nicholas Gyeney', u'Andr\xe9']
writers = u", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))
The answer is to make everything unicode:
# -*- coding: utf-8 -*-
people = [u'Nicholas Gyeney', u'André']
writers = u", ".join(people)
print(writers)
print(u"Writers: {}".format(writers))