solr not tokenizing protected words

I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.

Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".

So the headline "Saturday: kim jong il had died" should be split into:

Saturday kim jong il had died

For this reason I decided to use protected words (protwords), where I add kim jong il. The schema.xml looks like this.

   <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
           <tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\&quot;|\(|\)|\\|\+|\*|&lt;|&gt;|([0-31]+\.)" />
           <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" 
                   protected="protwords.txt" />
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.TrimFilterFactory"/>
           <filter class="solr.StopFilterFactory"
           ignoreCase="true"
           words="stopwords.txt"
           enablePositionIncrements="true"
           />
        </analyzer>
   </fieldType>

Using the solr analysis it looks like that doesn't work! The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il everything works fine, the terms aren't split.

Is there a way to reach my goal: not to split specific words/word groups?

标签： solr lucene tokenize protected words

2条回答

男人必须洒脱

2楼-- · 2019-07-04 06:10

after searching the web a came to the point, that it's not possible to reach the goal. It looks like, this is not the focus of all the tokenizer and filters.

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2019-07-04 06:11

Here's what I think it's happening.

WordDelimiterFilterFactory is a token filter, so its job is to add, remove or change already generated tokens, (in this case, to split words into sub-words based on case transitions, hyphens, etc.), not to split documents into words, which is a job for the tokenizer (in this case, PatternTokenizerFactory). It seems that your tokenizer is missing a \s, so it's not splitting words and WordDelimiterFilterFactory is getting whole phrases.

In your example, WordDelimiterFilterFactory would be getting the whole phrase Saturday kim jong il had died and, as it doesn't match any of your protected words, it proceeds to split this "word" into sub-words (a whitespace is a non-alpanumeric character, so the word qualifies for splitting).

So here's a possible solution. Add a \s to your tokenizer pattern and then use KeywordMarkerFilterFactory to protect your words. Something like this:

<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\s|\?|\!|\.|\:|\;|\,|\&quot;|\(|\)|\\|\+|\*|&lt;|&gt;|([0-31]+\.)" />
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"
            ignoreCase="false"/>
    <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.StopFilterFactory"
           ignoreCase="true"
           words="stopwords.txt"
           enablePositionIncrements="true"
           />
  </analyzer>
</fieldType>

Update: OK, now that I double checked the documentation, this proposed solution is not likely going to work for you. I would focus on experimenting with SynonymFilterFactory. Check this message in the solr-user mailing list. It's a bit outdated, but gives some insight into the problem.

0人赞添加讨论(0) 举报

solr not tokenizing protected words

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间