solr not tokenizing protected words

2019-07-04 05:54发布

问题:

I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.

Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".

So the headline "Saturday: kim jong il had died" should be split into:

Saturday kim jong il had died

For this reason I decided to use protected words (protwords), where I add kim jong il. The schema.xml looks like this.

   <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
           <tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\&quot;|\(|\)|\\|\+|\*|&lt;|&gt;|([0-31]+\.)" />
           <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" 
                   protected="protwords.txt" />
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.TrimFilterFactory"/>
           <filter class="solr.StopFilterFactory"
           ignoreCase="true"
           words="stopwords.txt"
           enablePositionIncrements="true"
           />
        </analyzer>
   </fieldType>

Using the solr analysis it looks like that doesn't work! The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il everything works fine, the terms aren't split.

Is there a way to reach my goal: not to split specific words/word groups?

回答1:

Here's what I think it's happening.

WordDelimiterFilterFactory is a token filter, so its job is to add, remove or change already generated tokens, (in this case, to split words into sub-words based on case transitions, hyphens, etc.), not to split documents into words, which is a job for the tokenizer (in this case, PatternTokenizerFactory). It seems that your tokenizer is missing a \s, so it's not splitting words and WordDelimiterFilterFactory is getting whole phrases.

In your example, WordDelimiterFilterFactory would be getting the whole phrase Saturday kim jong il had died and, as it doesn't match any of your protected words, it proceeds to split this "word" into sub-words (a whitespace is a non-alpanumeric character, so the word qualifies for splitting).

So here's a possible solution. Add a \s to your tokenizer pattern and then use KeywordMarkerFilterFactory to protect your words. Something like this:

<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\s|\?|\!|\.|\:|\;|\,|\&quot;|\(|\)|\\|\+|\*|&lt;|&gt;|([0-31]+\.)" />
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"
            ignoreCase="false"/>
    <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.StopFilterFactory"
           ignoreCase="true"
           words="stopwords.txt"
           enablePositionIncrements="true"
           />
  </analyzer>
</fieldType>

Update: OK, now that I double checked the documentation, this proposed solution is not likely going to work for you. I would focus on experimenting with SynonymFilterFactory. Check this message in the solr-user mailing list. It's a bit outdated, but gives some insight into the problem.



回答2:

after searching the web a came to the point, that it's not possible to reach the goal. It looks like, this is not the focus of all the tokenizer and filters.