I am running the following code on IDLE(Python) and I want to enter Arabic string and get the stemming for it but actually it doesn't work
">>> from nltk.stem.isri import ISRIStemmer
">>> st = ISRIStemmer()
">>> w= 'حركات'
">>> join = w.decode('Windows-1256')
">>> print st.stem(join).encode('Windows-1256').decode('utf-8')
The result of running it is the same text in w which is 'حركات' which is not the stem
but when do the following:
">>> print st.stem(u'اعلاميون')
the result succeeded and returns the stem which is 'علم'
why passing variable to stem() function doesn't return the stem.
there is a new light arabicstemmer here developed with snowball framework
This code above won't work in Python 3 because we are trying to decode an object that is already decoded. So, there is no need to decode from UTF-8 anymore.
Here is the new code that should work just fine in Python 3.
Well, just notice that your two strings actually only differ by a mere "u" at the beginning of the second string :
But that tiny "u" made all the difference :
w
is a UTF8 string (default character encoding in Python), whilew2
is a Unicode string.Hence all you really need to do is make sure your string is defined as a Unicode string, and then you can use the
stem
function normally without any extra decoding step :Ok, I solved the problem by myself using the following:
and it gives the root correctly which is "حرك"