Lucens best way to do “starts-with” queries

2019-05-10 12:55发布

问题:

I want to be able to do the following types of queries:

The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.

Example:

For documents:

  • Video1Title = Sea is blue
  • Video2Title = Wild sea
  • Video3Title = Wild sea Whatever
  • Video4Title = Seaside Whatever

If I search "sea" I want to get

  • "Video1Title = Sea is blue"

first followed by all the other documents that contain "sea" in title, but not at the beginning.

If I search "Wild sea" I want to get

  • Video2Title = Wild sea
  • Video3Title = Wild sea Whatever

first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.

If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).

Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."

There are "workarounds" to simulate that behaviour:

  • Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :

    +(field1:word1 word2 word3) (field2:"word1 word2 word3*")

effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).

  • Add a "lucene_start_token" to the beginning of the field when indexing them such that Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest

Then do a query such that:

+(title:sea) (title:"lucene_start_token sea")

and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"

My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?

回答1:

You can use Lucene Payloads for that. You can give custom boost for every term of the field value.

So, when you index your titles you can start using a boost factor of 3 (for example):

title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5

title: sea|3.0 creatures|2.5

Indexing this way you are boosting nearest terms to the start of title.

The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).



回答2:

What you could do is index the title and each token separately, e.g. text wild deep blue endless sea would be indexed like:

title: wild deep blue endless sea
t1: wild
t2: deep
t3: blue
t4: endless
t5: sea

Then if someone queries "wild deep", the query would be rewritten into

title:"wild deep" OR (t1:wild AND t2:deep)

This way you will always find all matching documents (if they match title) but matching t1..tN tokens will score the relevant documents higher.