How do I do word Stemming or Lemmatization?-第2页回答

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

标签： nlp stemming lemmatization

21条回答

神经病院院长

2楼-- · 2019-01-03 20:09

Take a look at LemmaGen - open source library written in C# 3.0.

Results for your test words (http://lemmatise.ijs.si/Services)

cats -> cat
running
ran -> run
cactus
cactuses -> cactus
cacti -> cactus
community
communities -> community

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-01-03 20:10

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

0人赞添加讨论(0) 举报

我想做一个坏孩纸

4楼-- · 2019-01-03 20:11

http://wordnet.princeton.edu/man/morph.3WN

For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.

http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.

0人赞添加讨论(0) 举报

淡お忘

5楼-- · 2019-01-03 20:12

In Java, i use tartargus-snowball to stemming words

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

Sample code:

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

0人赞添加讨论(0) 举报

对你真心纯属浪费

6楼-- · 2019-01-03 20:14

I tried your list of terms on this snowball demo site and the results look okay....

cats -> cat
running -> run
ran -> ran
cactus -> cactus
cactuses -> cactus
community -> communiti
communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

0人赞添加讨论(0) 举报

迷人小祖宗

7楼-- · 2019-01-03 20:15

This looks interesting: MIT Java WordnetStemmer: http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStemmer.html

0人赞添加讨论(0) 举报

How do I do word Stemming or Lemmatization?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间