LUCENE Standard Analyzer Hyphen consideration

2019-09-09 09:01发布

站内文章 / 后端开发

9 0

做个烂人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

While indexing my document using lucene Standard Analyzer I got a plroblem.

For example: my document had a word "plag-iarism" ... here this analyzer indexed it as "plag" and "iarism". But I want like "plagiarism". What I have to do to get a whole word?

回答1:

StandardAnalyzer delegates tokanization to StandardTokenizer. You create your own tokanizer to match your exact needs (you could base it on StandardTokenizer).

Alternatively, if you prefer, you could do a dirty hack of a String.replace(), with the relevant regular expression, just the analyzer runs. Yeah. Ugly.

标签： lucene