可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In my analyzer chain, ShingleFilter comes after stopword filter. As mentioned in the docs, ShingleFilter handles position increments > 1 by inserting filler tokens (tokens with termtext "_").

For example : "please divide this sentence into biword shingles" 

Shingles of size 2 : please divide, divide _, _ sentence, sentence _, _ biword, biword shingles (assuming that "this, "into" are stopwords)

I would like to eliminate those shingles with the filler tokens, i.e. my desired output contains only: please divide, biword shingles.

I've a dedicated field for facets with shingles up to 4-grams. Due to these stopwords, all the facet constraints (or values) look useless with those fillers like "divide _ sentence _"

Please could you guide me.

Using Solr 4.4.

UPDATE

I thought of setting enablePositionIncrement to false in StopFilter configuration. Not sure whether that solves the problem or not but Lucene 4.4 doesn't support that anymore.

回答1:

Add PatternReplaceFilterFactory in your analyzer chain after ShingleFilterFactory. Replace all Token containing filler token with empty string i.e. "".

This may solve your problem temporarily but for permanent solution have to write your own analyzer or customize ShingleFilter.

Sample FieldType:

<fieldType name="text_general_shingle" class="solr.TextField" positionIncrementGap="100">     
        <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />       
        <filter class="solr.LowerCaseFilterFactory"/>           
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>       
    </analyzer>     
    </fieldType>

回答2:

PositionFilter should do the job. It is deprecated (see the Lucene documentation, for why), but it should work.

...
<filter class="solr.LowerCaseFilterFactory"/>           
<filter class="solr.PositionFilterFactory" positionIncrement="1"/>       
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>

Make sure you apply it at both query and index time, of course.

That said, are you sure you need this? Since the positionIncrements should be applied in similar ways at query and index time, having them will generally be helpful. Are you seeing particular problems when querying the index? Or just seeing strange things in debug output?

回答3:

In Solr 4.7 release, you have the option to override the default filler token of "_". You could set it to an empty space. The configuration will be like :

<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" fillerToken=""/>

Lucene Analyzer chain: ShingleFilter without fille

问题:

回答1:

回答2:

回答3:

收藏的人(0)

Lucene Analyzer chain: ShingleFilter without fille

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮