Using CharFilter with Lucene 4.3.0's StandardA

I am trying to add a CharFilter to my StandardAnalyzer. My intention is to strip out punctuation from all the text I index; for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".

It seems that the easiest plan of attack here is to filter out all punctuation before analysis. Per the Analyzer package documentation, that means I should use a CharFilter.

However, it seems next to impossible to actually insert a CharFilter into the analyzer!

The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".

If my code extends Analyzer, I can extend initReader but I cannot delegate the abstract createComponents to my base StandardAnalyzer, as it is protected. I cannot delegate tokenStream to my base analyzer, because it is final. So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.

There is an AnalyzerWrapper class that seems perfect for what I want! I can provide a base analyzer and only override the pieces that I want. Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"! Bummer!

I guess I could have my Analyzer be in the org.apache.lucene.analyzers package and then I can access the protected createComponents method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.

Am I missing something glaring here? How can I amend a StandardAnalyzer to use a custom CharFilter?

标签： java lucene

1条回答

乱世女痞

2楼-- · 2019-02-23 21:56

The intent is for you to override Analyzer, rather than StandardAnalyzer. The thinking is that you should never subclass an Analyzer implementation (some discussion of there here). Analyzer implementations are pretty straightforward though, and adding a CharFilter to an Analyzer implementing the same tokenizer/filter chain as StandardAnalyzer would look something like:

public final class MyAnalyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
        TokenStream tok = new StandardFilter(matchVersion, src);
        tok = new LowerCaseFilter(matchVersion, tok);
        tok = new StopFilter(matchVersion, tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        return new TokenStreamComponents(src, tok);
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        //return your CharFilter-wrapped reader here
    }
}

0人赞添加讨论(0) 举报

Using CharFilter with Lucene 4.3.0's StandardA

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间