Python ISRIStemmer for Arabic text

I am running the following code on IDLE(Python) and I want to enter Arabic string and get the stemming for it but actually it doesn't work

">>> from nltk.stem.isri import ISRIStemmer

">>> st = ISRIStemmer()

">>> w= 'حركات'

">>> join = w.decode('Windows-1256')

">>> print st.stem(join).encode('Windows-1256').decode('utf-8')

The result of running it is the same text in w which is 'حركات' which is not the stem

but when do the following:

">>> print st.stem(u'اعلاميون')

the result succeeded and returns the stem which is 'علم'

why passing variable to stem() function doesn't return the stem.

标签： arabic stemming utf8-decode

4条回答

神经病院院长

2楼-- · 2019-02-07 15:44

there is a new light arabicstemmer here developed with snowball framework

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2019-02-07 15:54

This code above won't work in Python 3 because we are trying to decode an object that is already decoded. So, there is no need to decode from UTF-8 anymore.

Here is the new code that should work just fine in Python 3.

import nltk
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
w= 'حركات'
print(st.stem(w))

0人赞添加讨论(0) 举报

手持菜刀，她持情操

4楼-- · 2019-02-07 16:03

Well, just notice that your two strings actually only differ by a mere "u" at the beginning of the second string :

w = 'حركات'
w2 = u'اعلاميون'

But that tiny "u" made all the difference : w is a UTF8 string (default character encoding in Python), while w2 is a Unicode string.

Hence all you really need to do is make sure your string is defined as a Unicode string, and then you can use the stem function normally without any extra decoding step :

w = u'حركات'
print st.stem(w)

0人赞添加讨论(0) 举报

狗以群分

5楼-- · 2019-02-07 16:08

Ok, I solved the problem by myself using the following:

w='حركات'

st.stem(w.decode('utf-8'))

and it gives the root correctly which is "حرك"

0人赞添加讨论(0) 举报

Python ISRIStemmer for Arabic text

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间