I have text field.
And for given query I want to find all documents that contains indexed field values.
query.contains(document.field_name)
Examples:
1. field_name:"a b"
2. field_name:"a b c"
For query "a b d" I want to find only first item.
Not efficient way to do this is basically generate all substrings of query and index field as a string.
Is it possible to implements such requirements in Solr using existen functionality?
If not what is the most efficient algorithm/way to do this?
PS. Seems like google adwords do such matching for finding adds.
I think it might be difficult to do this in a single Solr query. If I have understood your question correctly, I think what I would do is tokenize the query string, search for each token in turn, and the compare the search results with the initial query string. For example, suppose your query string is "term1 term2 term3". You would search for each of these terms in turn:
/solr/index/select?q=term1
This might return the following:
term1 term2 term4
term1 term2
term1 term2 term3
You could then run a comparison against your initial query ("term1 term2 term3") to see if it contains each search result. Apologies if the above isn't helpful.
Here's one way to do what you're asking for:
Field Type
<fieldType name="exact" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="0" generateWordParts="0" catenateAll="1" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="0" generateWordParts="1" catenateAll="0" />
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramsIfNoShingles="true" tokenSeparator="" maxShingleSize="99"/>
</analyzer>
</fieldType>
Explanation:
The index analyzer uses WordDelimiterFilterFactory
to split the field value into words. So using your example, a b
is split into the wordsa
and b
, and a b d
is split into a
, b
, and d
. We set catenateAll="1"
and generateWordParts="0"
so the individual words are discarded, resulting in a single word. a
and b
become ab
and a
, b
and d
become abd
.
The analyzer for queries is similar with minor differences. We split the value into words except we do not discard the words or concatenate them. Instead, we pass the words to the ShingleFilterFactory
, which takes the a
and b
and returns a
, b
, and ab
.
The reason we use shingles instead of concatenation is to allow a b c
to match a b
and b c
. If you want a b c
to only match a b c
, set catenateAll="1"
and remove the shingle factory.
Using this configuration, a b
will match only a
, b
, and a b
(not a b d
). Also, a b c
will match a
, b
, c
, a b
, b c
, and a b c
. It should also be noted that ab
will match a b
. If any of this is not what you want, you should be able to configure the shingle and word filter factories to do exactly what you need.
EDIT: Previous versions of this answer put magic values to mark the start and end of the value. It turns out that is unnecessary; just concatenating the values together is enough to prevent a b
from matching a b d
.
EDIT 2 (index analyzer fix): WhitespaceTokenizerFactory
should have been KeywordTokenizerFactory
. Also, the WordDelimiterFilterFactory
should have catenateAll="0"
.