As it is summer now, I decided to learn a new language and Python was my choice. Really, what I would like to learn is how to manipulate Arabic text using Python. Now, I have found many many resources on using Python, which are really great. However, when I apply what I learned on Arabic strings, I get numbers and letters combined together.
Take for example this for English:
>>> ebook = 'The American English Dictionary'
>>> ebook[2]
'e'
Now, for Arabic:
>>> abook = 'القاموس العربي'
>>> abook[2]
'\xde' #the correct output should be 'ق'
However, using print
works fine, as in:
>>> print abook[2]
ق
What do I need to modify to get Python to always recognize Arabic letters?
Use Unicode explicitly:
>>> s = u'القاموس العربي'
>>> s
u'\u0627\u0644\u0642\u0627\u0645\u0648\u0633 \u0627\u0644\u0639\u0631\u0628\u064a'
>>> print s
القاموس العربي
>>> print s[2]
ق
Or even character by character:
>>> for i, c in enumerate(s):
... print i,c
...
0 ا
1 ل
2 ق
3 ا
4 م
5 و
6 س
7
8 ا
9 ل
10 ع
11 ر
12 ب
13 ي
14
I recommend the Python Unicode page which is short, practical and useful.
Use python 3.x: strings are now unicode- see python 3 what is new
>>> abook = 'القاموس العربي'
>>> abook[0]
'ا'
>>> abook[4]
'م'
If you want the input:
>>> abook[2]
to produce the following output:
'ق'
it'll never happen. The interactive shell prints repr(abook[2])
, which will always use escape sequences for arabic characters. I don't know the exact rules, but I'm guessing that most characters outside the ASCII universe will be escaped. To make it work as advertised, you use the u
prefix, but it will still output an escape sequence (albeit the correct one, this time):
>>> abook = u'القاموس العربي'
>>> abook[2]
u'\u0642'
The reason you get '\xde'
is that without the u
prefix, abook holds the UTF-8 encoding of the phrase. My output differs from yours (possibly because the code points were altered through copy-pasting; I'm not sure), but the principle still holds:
>>> abook = 'القاموس العربي'
>>> ' '.join( hex(ord(c))[-2:] for c in abook )
'd8 a7 d9 84 d9 82 d8 a7 d9 85 d9 88 d8 b3 20 d8 a7 d9 84 d8 b9 d8 b1 d8 a8 d9 8a'
>>> abook[2]
'\xd9'
You can confirm this as follows:
>>> abook = 'القاموس العربي'
>>> unicode(abook, 'utf-8')[2]
u'\u0642'
>>> print unicode(abook, 'utf-8')[2]
ق
Going by the result in the comments on the question, this looks like repr
is causing a mojibake issue - that is, it is getting confused about encodings and using the wrong one. print
will try to use the encoding it thinks your STDOUT uses, and print the resultant bytes directly - repr tries to print an ASCII-safe representation, although seems to be failing badly in this situation.
The good news is - this is an issue with repr
, not with Python's Unicode handling. As long as the roundtrip: s.encode('utf8').decode('utf8') == s
works, you're fine. print
the value when you want to inspect it, don't just mention it at the interative terminal, and use Unicode strings everywhere (using Py3 will help massively with this, or at minimum do:
from __future__ import unicode_literals
from io import open
), keep track of encodings, and your program will work even if repr
happens to do something bizarre.
Also note that your question is not about UTF8 in any way - its about Unicode, which is a different (though related) concept. If the resources you've been reading haven't enforced this difference, get better resources - a misunderstanding of these concepts will lead you to a lot of pain.