How do I do word Stemming or Lemmatization?

2019-01-03 19:51发布

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

21条回答
神经病院院长
2楼-- · 2019-01-03 20:09

Take a look at LemmaGen - open source library written in C# 3.0.

Results for your test words (http://lemmatise.ijs.si/Services)

  • cats -> cat
  • running
  • ran -> run
  • cactus
  • cactuses -> cactus
  • cacti -> cactus
  • community
  • communities -> community
查看更多
可以哭但决不认输i
3楼-- · 2019-01-03 20:10

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

查看更多
我想做一个坏孩纸
4楼-- · 2019-01-03 20:11

http://wordnet.princeton.edu/man/morph.3WN

For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.

http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.

查看更多
淡お忘
5楼-- · 2019-01-03 20:12

In Java, i use tartargus-snowball to stemming words

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

Sample code:

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}
查看更多
对你真心纯属浪费
6楼-- · 2019-01-03 20:14

I tried your list of terms on this snowball demo site and the results look okay....

  • cats -> cat
  • running -> run
  • ran -> ran
  • cactus -> cactus
  • cactuses -> cactus
  • community -> communiti
  • communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

查看更多
迷人小祖宗
7楼-- · 2019-01-03 20:15
登录 后发表回答