In my analyzer chain, ShingleFilter comes after stopword filter. As mentioned in the docs, ShingleFilter handles position increments > 1 by inserting filler tokens (tokens with termtext "_").
For example : "please divide this sentence into biword shingles"
Shingles of size 2 : please divide, divide _, _ sentence, sentence _, _ biword, biword shingles (assuming that "this, "into" are stopwords)
I would like to eliminate those shingles with the filler tokens, i.e. my desired output contains only: please divide, biword shingles.
I've a dedicated field for facets with shingles up to 4-grams. Due to these stopwords, all the facet constraints (or values) look useless with those fillers like "divide _ sentence _"
Please could you guide me.
Using Solr 4.4.
UPDATE
I thought of setting enablePositionIncrement to false in StopFilter configuration. Not sure whether that solves the problem or not but Lucene 4.4 doesn't support that anymore.
Add PatternReplaceFilterFactory
in your analyzer chain after ShingleFilterFactory
. Replace all Token containing filler token with empty string i.e. "".
This may solve your problem temporarily but for permanent solution have to write your own analyzer or customize ShingleFilter.
Sample FieldType:
<fieldType name="text_general_shingle" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
</analyzer>
</fieldType>
PositionFilter
should do the job. It is deprecated (see the Lucene documentation, for why), but it should work.
...
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PositionFilterFactory" positionIncrement="1"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
Make sure you apply it at both query and index time, of course.
That said, are you sure you need this? Since the positionIncrements should be applied in similar ways at query and index time, having them will generally be helpful. Are you seeing particular problems when querying the index? Or just seeing strange things in debug output?
In Solr 4.7 release, you have the option to override the default filler token of "_". You could set it to an empty space. The configuration will be like :
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" fillerToken=""/>