-->

Java Lucene Stop Words Filter

2019-08-20 08:50发布

问题:

I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:

for(String curS: Sentences)
{
          reader = new StringReader(curS);
          tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
          tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
          tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
          tokenizer = new ShingleFilter(tokenizer, 2, 3);
          charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    while(tokenizer.incrementToken())
    {
        curNGram = charTermAttribute.toString().toString();
        nGrams.add(curNGram);                   //store each token into an ArrayList
    }
}

For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.

  1. Why are stop words being added to my list when I am using the StopFiler?

All help is appreciated!

回答1:

What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.

Try something like:

//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);

Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.

Were you already generating stopWords like that?