Java Lucene Stop Words Filter

2019-08-20 08:57发布

I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:

for(String curS: Sentences)
{
          reader = new StringReader(curS);
          tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
          tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
          tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
          tokenizer = new ShingleFilter(tokenizer, 2, 3);
          charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    while(tokenizer.incrementToken())
    {
        curNGram = charTermAttribute.toString().toString();
        nGrams.add(curNGram);                   //store each token into an ArrayList
    }
}

For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.

Why are stop words being added to my list when I am using the StopFiler?

All help is appreciated!

标签： java filter lucene words

1条回答

走好不送

2楼-- · 2019-08-20 09:23

What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.

Try something like:

//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);

Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.

Were you already generating stopWords like that?

0人赞添加讨论(0) 举报

Java Lucene Stop Words Filter

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间