I have a situation where I need to use both EdgeNGramFilterFactory and NGramFilterFactory.
I am using NGramFilterFactory to perform a "contains" style search with min number of characters as 2. I also want to search for the first letter, like a "startswith" with a front EdgeNGramFilterFactory.
I dont want to lower the NGramFilterFactory to min characters of 1 as I dont want to index all characters.
Some help would be greatly appreciated
Cheers
You don't necessarily have to do all this in the same field. I would create a different fields using different custom types for each treatment so that you can apply the logic separately.
In the following:
text
contains the original tokens, minimally processed;
text_ngram
uses the NGramFilter for your two-character-minimum tokens
text_first_letter
uses EdgeNGram for your one-character initial-letter tokens
If you're processing all text
fields in this way, then you might be able to get away with using a copyField
to populate the fields. Otherwise, you can instruct your Solr client to send in the same field values for the three separate field types.
When searching, include all of them in your searches with the qf
parameter.
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldType>
<fieldType name="text_first_letter" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1" side="front"/>
</analyzer>
</fieldType>
Setting up field
and dynamicField
definitions are left up to you. Or let me know if you have more questions and I can edit with clarifications.
Start by applying the EdgeNgramFilter with min = 1 and max = 1000 (we want the entire original token to be included). Example:
hello => 'h', 'he', 'hel', 'hell', 'hello'
Secondly use the NGramFilter with min = 2. (I will use 2 as the max in the example for simplicity)
'h', 'he', 'hel', 'hell', 'hello' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo'
Now you will have several identical tokens since you have applied the NGramFilter on all "partial" tokens from the EdgeNGramFilter but simply apply the RemoveDuplicatesTokensFilter to remove those.
'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', 'he', 'el', 'll', 'lo'
Now your field will support a single char "startsWith" query and a multiple chars "contains" query.