How do I do word Stemming or Lemmatization?

2019-01-03 19:51发布

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

21条回答
再贱就再见
2楼-- · 2019-01-03 20:15

Martin Porter wrote Snowball (a language for stemming algorithms) and rewrote the "English Stemmer" in Snowball. There are is an English Stemmer for C and Java.

He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons, so testing stemming correctness against the Porter Stemmer will get you results that you (should) already know.

From http://tartarus.org/~martin/PorterStemmer/index.html (emphasis mine)

The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which derives from it, and which is subjected to occasional improvements. For practical work, therefore, the new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable.

Dr. Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer. The English stemmer is what's actually used in the demo site as @StompChicken has answered earlier.

查看更多
爷的心禁止访问
3楼-- · 2019-01-03 20:16

I highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy).

Lemmatized words are available by default in Spacy as a token's .lemma_ attribute and text can be lemmatized while doing a lot of other text preprocessing with textacy. For example while creating a bag of terms or words or generally just before performing some processing that requires it.

I'd encourage you to check out both before writing any code, as this may save you a lot of time!

查看更多
兄弟一词,经得起流年.
4楼-- · 2019-01-03 20:19

Try this one here: http://www.twinword.com/lemmatizer.php

I entered your query in the demo "cats running ran cactus cactuses cacti community communities" and got ["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"] with the optional flag ALL_TOKENS.

Sample Code

This is an API so you can connect to it from any environment. Here is what the PHP REST call may look like.

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);
查看更多
时光不老,我们不散
5楼-- · 2019-01-03 20:25

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

查看更多
何必那么认真
6楼-- · 2019-01-03 20:25

The top python packages (in no specific order) for lemmatization are: spacy, nltk, gensim, pattern, CoreNLP and TextBlob. I prefer spaCy and gensim's implementation (based on pattern) because they identify the POS tag of the word and assigns the appropriate lemma automatically. The gives more relevant lemmas, keeping the meaning intact.

If you plan to use nltk or TextBlob, you need to take care of finding the right POS tag manually and the find the right lemma.

Lemmatization Example with spaCy:

# Run below statements in terminal once. 
pip install spacy
spacy download en

import spacy

# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse
doc = nlp(sentence)

# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

Lemmatization Example With Gensim:

from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

The above examples were borrowed from in this lemmatization page.

查看更多
你好瞎i
7楼-- · 2019-01-03 20:26

You could use the Morpha stemmer. UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application. There's a wrapper that makes it much easier to use. You just need to add it as a dependency and use the edu.washington.cs.knowitall.morpha.MorphaStemmer class. Instances are threadsafe (the original JFlex had class fields for local variables unnecessarily). Instantiate a class and run morpha and the word you want to stem.

new MorphaStemmer().morpha("climbed") // goes to "climb"
查看更多
登录 后发表回答