LUCENE Standard Analyzer Hyphen consideration

2019-09-09 09:01发布

问题:

While indexing my document using lucene Standard Analyzer I got a plroblem.

For example: my document had a word "plag-iarism" ... here this analyzer indexed it as "plag" and "iarism". But I want like "plagiarism". What I have to do to get a whole word?

回答1:

StandardAnalyzer delegates tokanization to StandardTokenizer. You create your own tokanizer to match your exact needs (you could base it on StandardTokenizer).

Alternatively, if you prefer, you could do a dirty hack of a String.replace(), with the relevant regular expression, just the analyzer runs. Yeah. Ugly.



标签: lucene