Lucene search and underscores

2019-01-25 11:11发布

When I use Luke to search my Lucene index using a standard analyzer, I can see the field I am searchng for contains values of the form MY_VALUE. When I search for field:"MY_VALUE" however, the query is parsed as field:"my value"

Is there a simple way to escape the underscore (_) character so that it will search for it?

EDIT:

4/1/2010 11:08AM PST

I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before. Load up Luke and try to search for "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:

"bb hhh_ffff5_ssss"

After some testing, I've found that this is because of the number. If I input

"BB_HHH_FFFF_SSSS", I get

"bb hhh ffff ssss"

At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.

Can anyone confirm this?

2条回答
走好不送
2楼-- · 2019-01-25 11:50

I don't think you'll be able to use the standard analyser for this use case.

Judging what I think your requirements are, the keyword analyser should work fine for little effort (the whole field becomes a single term).

I think some of the confusion arises when looking at the field with luke. The stored value is not what's used by queries, what you need are the terms. I suspect that when you look at the terms stored for your field, they'll be "my" and "value".

Hope this helps,

查看更多
Bombasti
3楼-- · 2019-01-25 12:00

It doesn't look like you used the StandardAnalyzer to index that field. In Luke you'll need to select the analyzer that you used to index that field in order to match MY_VALUE correctly.

Incidentally, you might be able to match MY_VALUE by using the KeywordAnalyzer.

查看更多
登录 后发表回答