Stemming English words with Lucene

I'm processing some English texts in a Java application, and I need to stem them. For example, from the text "amenities/amenity" I need to get "amenit".

The function looks like:

String stemTerm(String term){
   ...
}

I've found the Lucene Analyzer, but it looks way too complicated for what I need. http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/PorterStemFilter.html

Is there a way to use it to stem words without building an Analyzer? I don't understand all the Analyzer business...

EDIT: I actually need a stemming + lemmatization. Can Lucene do this?

标签： java lucene stemming porter-stemmer

6条回答

乱世女痞

2楼-- · 2019-01-10 13:23

Why aren't you using the "EnglishAnalyzer"? It's simple to use it and I think it'd solve your problem:

EnglishAnalyzer en_an = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "your_field", en_an);
String str = "amenities";
System.out.println("result: " + parser.parse(str)); //amenit

Hope it helps you!

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2019-01-10 13:26

The previous example applies stemming to a search query, so if you are interesting to stem a full text you can try the following:

import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.snowball.*;
import org.apache.lucene.util.*;
...
public class Stemmer{
    public static String Stem(String text, String language){
        StringBuffer result = new StringBuffer();
        if (text!=null && text.trim().length()>0){
            StringReader tReader = new StringReader(text);
            Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_35,language);
            TokenStream tStream = analyzer.tokenStream("contents", tReader);
            TermAttribute term = tStream.addAttribute(TermAttribute.class);

            try {
                while (tStream.incrementToken()){
                    result.append(term.term());
                    result.append(" ");
                }
            } catch (IOException ioe){
                System.out.println("Error: "+ioe.getMessage());
            }
        }

        // If, for some reason, the stemming did not happen, return the original text
        if (result.length()==0)
            result.append(text);
        return result.toString().trim();
    }

    public static void main (String[] args){
        Stemmer.Stem("Michele Bachmann amenities pressed her allegations that the former head of her Iowa presidential bid was bribed by the campaign of rival Ron Paul to endorse him, even as one of her own aides denied the charge.", "English");
    }
}

The TermAttribute class has been deprecated and will not longer be supported in Lucene 4, but the documentation is not clear on what to use at its place.

Also in the first example the PorterStemmer is not available as a class (hidden) so you cannot use it directly.

Hope this helps.

0人赞添加讨论(0) 举报

The star\"

4楼-- · 2019-01-10 13:29

SnowballAnalyzer is deprecated, you can use Lucene Porter Stemmer instead:

 PorterStemmer stem = new PorterStemmer();
 stem.setCurrent(word);
 stem.stem();
 String result = stem.getCurrent();

Hope this help!

0人赞添加讨论(0) 举报

家丑人穷心不美

5楼-- · 2019-01-10 13:32

Here is how you can use Snowball Stemmer in JAVA:

import org.tartarus.snowball.ext.EnglishStemmer;

EnglishStemmer english = new EnglishStemmer();
String[] words = tokenizer("bank banker banking");
for(int i = 0; i < words.length; i++){
        english.setCurrent(words[i]);
        english.stem();
        System.out.println(english.getCurrent());
}

0人赞添加讨论(0) 举报

祖国的老花朵

6楼-- · 2019-01-10 13:38

import org.apache.lucene.analysis.PorterStemmer;
...
String stemTerm (String term) {
    PorterStemmer stemmer = new PorterStemmer();
    return stemmer.stem(term);
}

See here for more details. If stemming is all you want to do, then you should use this instead of Lucene.

Edit: You should lowercase term before passing it to stem().

0人赞添加讨论(0) 举报

狗以群分

7楼-- · 2019-01-10 13:40

Ling pipe provides a number of tokenizers . They can be used for stemming and stop word removal . Its a simple and a effective means of stemming.

0人赞添加讨论(0) 举报

Stemming English words with Lucene

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间