Python Latin Characters and Unicode

I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.

Here is the code I have for adding these keywords to the list:

print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list

If the keyword in this case is université, then my output is:

Adding: université
['universit\xc3\xa9']

It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.

How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.

标签： python python-2.7 unicode latin1 python-unicode

2条回答

小情绪 Triste *

2楼-- · 2020-02-15 05:39

When you print a list, you get the repr of the items it contains, which for strings is different from their contents:

>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'

The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.

>>> print '[', ', '.join(a), ']'
[ foo, bär ]

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2020-02-15 06:03

You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.

When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.

If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:

>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université

Compare this to byte strings:

>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université

Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.

I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

0人赞添加讨论(0) 举报

Python Latin Characters and Unicode

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间