Working with UTF-8 in Python

2019-04-12 04:14发布

问题:

As it is summer now, I decided to learn a new language and Python was my choice. Really, what I would like to learn is how to manipulate Arabic text using Python. Now, I have found many many resources on using Python, which are really great. However, when I apply what I learned on Arabic strings, I get numbers and letters combined together.

Take for example this for English:

>>> ebook = 'The American English Dictionary'
>>> ebook[2]
'e'

Now, for Arabic:

>>> abook = 'القاموس العربي'
>>> abook[2]
'\xde'                  #the correct output should be 'ق'

However, using print works fine, as in:

>>> print abook[2]
ق

What do I need to modify to get Python to always recognize Arabic letters?

回答1:

Use Unicode explicitly:

>>> s = u'القاموس العربي'
>>> s
u'\u0627\u0644\u0642\u0627\u0645\u0648\u0633 \u0627\u0644\u0639\u0631\u0628\u064a'
>>> print s
القاموس العربي

>>> print s[2]
ق

Or even character by character:

>>> for i, c in enumerate(s):
...     print i,c
... 
0 ا
1 ل
2 ق
3 ا
4 م
5 و
6 س
7  
8 ا
9 ل
10 ع
11 ر
12 ب
13 ي
14 

I recommend the Python Unicode page which is short, practical and useful.



回答2:

Use python 3.x: strings are now unicode- see python 3 what is new

>>> abook = 'القاموس العربي'
>>> abook[0]
'ا'
>>> abook[4]
'م'


回答3:

If you want the input:

>>> abook[2]

to produce the following output:

'ق'

it'll never happen. The interactive shell prints repr(abook[2]), which will always use escape sequences for arabic characters. I don't know the exact rules, but I'm guessing that most characters outside the ASCII universe will be escaped. To make it work as advertised, you use the u prefix, but it will still output an escape sequence (albeit the correct one, this time):

>>> abook = u'القاموس العربي'
>>> abook[2]
u'\u0642'

The reason you get '\xde' is that without the u prefix, abook holds the UTF-8 encoding of the phrase. My output differs from yours (possibly because the code points were altered through copy-pasting; I'm not sure), but the principle still holds:

>>> abook = 'القاموس العربي'
>>> ' '.join( hex(ord(c))[-2:] for c in abook )
'd8 a7 d9 84 d9 82 d8 a7 d9 85 d9 88 d8 b3 20 d8 a7 d9 84 d8 b9 d8 b1 d8 a8 d9 8a'
>>> abook[2]
'\xd9'

You can confirm this as follows:

>>> abook = 'القاموس العربي'
>>> unicode(abook, 'utf-8')[2]
u'\u0642'
>>> print unicode(abook, 'utf-8')[2]
ق


回答4:

Going by the result in the comments on the question, this looks like repr is causing a mojibake issue - that is, it is getting confused about encodings and using the wrong one. print will try to use the encoding it thinks your STDOUT uses, and print the resultant bytes directly - repr tries to print an ASCII-safe representation, although seems to be failing badly in this situation.

The good news is - this is an issue with repr, not with Python's Unicode handling. As long as the roundtrip: s.encode('utf8').decode('utf8') == s works, you're fine. print the value when you want to inspect it, don't just mention it at the interative terminal, and use Unicode strings everywhere (using Py3 will help massively with this, or at minimum do:

from __future__ import unicode_literals
from io import open

), keep track of encodings, and your program will work even if repr happens to do something bizarre.

Also note that your question is not about UTF8 in any way - its about Unicode, which is a different (though related) concept. If the resources you've been reading haven't enforced this difference, get better resources - a misunderstanding of these concepts will lead you to a lot of pain.