I want to be able to do the following types of queries:
The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.
Example:
For documents:
- Video1Title = Sea is blue
- Video2Title = Wild sea
- Video3Title = Wild sea Whatever
- Video4Title = Seaside Whatever
If I search "sea" I want to get
- "Video1Title = Sea is blue"
first followed by all the other documents that contain "sea" in title, but not at the beginning.
If I search "Wild sea" I want to get
- Video2Title = Wild sea
- Video3Title = Wild sea Whatever
first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.
If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).
Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."
There are "workarounds" to simulate that behaviour:
Index the field twice. In
field1
you have the words tokenized (using perhapsStandardAnalyzer
) and infield2
you have them all clumped up into one element (usingKeywordAnalyzer
). Then if you search something like :+(field1:word1 word2 word3) (field2:"word1 word2 word3*")
effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).
- Add a "lucene_start_token" to the beginning of the field when indexing them such that
Video2Title = Wild sea
is indexed as "title:lucene_start_token Wild sea
" and so on for the rest
Then do a query such that:
+(title:sea) (title:"lucene_start_token sea")
and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"
My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?
What you could do is index the title and each token separately, e.g. text
wild deep blue endless sea
would be indexed like:Then if someone queries "wild deep", the query would be rewritten into
This way you will always find all matching documents (if they match
title
) but matchingt1..tN
tokens will score the relevant documents higher.You can use Lucene Payloads for that. You can give custom boost for every term of the field value.
So, when you index your titles you can start using a boost factor of 3 (for example):
title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5
title: sea|3.0 creatures|2.5
Indexing this way you are boosting nearest terms to the start of title.
The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).