This question already has an answer here:
Python doesn't seem to be working with Arabic letters here in the code below. Any ideas?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import nltk
sentence = "ورود ممنوع"
tokens = nltk.word_tokenize(sentence)
print tokens
the result is:
>>>
['\xd9\x88\xd8\xb1\xd9\x88\xd8\xaf', '\xd9\x85\xd9\x85\xd9\x86\xd9\x88\xd8\xb9']
>>>
I also tried adding a u
before the string, but it didn't help:
>>> u"ورود ممنوع">>>
['\xd9\x88\xd8\xb1\xd9\x88\xd8\xaf', '\xd9\x85\xd9\x85\xd9\x86\xd9\x88\xd8\xb9']
You have correct results in list with byte strings:
to convert it to unicode you can use list comprehantion:
Printing Unicode Char inside a List