how to add custom stop words using lucene in java

I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene.

My Sample Code:

public class Stopwords_remove {
    public String removeStopWords(String string) throws IOException 
    {
        StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30);
        TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string));
        StringBuilder sb = new StringBuilder();
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET);
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        while (tokenStream.incrementToken()) 
        {
            if (sb.length() > 0) 
            {
                sb.append(" ");
            }
            sb.append(token.toString());
        }
        return sb.toString();
    }

    public static void main(String args[]) throws IOException
    {
          String text = "this is a java project written by james.";
          Stopwords_remove stopwords = new Stopwords_remove();
          stopwords.removeStopWords(text);

    }
}

output: java project written james.

required output: java project james.

How can I do this?

标签： java lucene stop-words

1条回答

再贱就再见

2楼-- · 2020-07-18 06:14

You could add add your additional stop words into a copy of the standard english stop word set, or just add in another StopFilter. Like:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));
CharArraySet stopSet = CharArraySet.copy(Version.LUCENE_36, StandardAnalyzer.STOP_WORD_SET);
stopSet.add("add");
stopSet.add("your");
stopSet.add("stop");
stopSet.add("words");
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stopSet);
//Or, if you just need the added stopwords in a standardanalyzer, you could just pass this stopfilter into the StandardAnalyzer...
//analyzer = new StandardAnalyzer(Version.LUCENE_36, stopSet);

or:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StandardAnalyzer.STOP_WORDS_SET);
List<String> stopWords = //your list of stop words.....
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StopFilter.makeStopSet(Version.LUCENE_36, stopWords));

If you are trying to create your own Analyzer, you might be better served following a pattern more like the example in the Analyzer documentation.

0人赞添加讨论(0) 举报

how to add custom stop words using lucene in java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间