How to correctly configure solr stemming

2019-07-21 19:05发布

问题:

I have configured a field in Solr as follows. When I search for the word "Conditioner", I was hoping to find words that contain "Conditioning" also. But based on Solr Analysis, the porterstemfilter is cutting the word "Conditioning" to "Condit" at index time. Hence, at the search time, when I query for "Conditioner", it is stemmed as "Condition" and hence not matching "Conditioning".

How to configure stemming so that both Conditioner and Conditioning should stem to condition?

<fieldType name="text_general" class="solr.TextField"
           positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" generateNumberParts="1" 
            catenateWords="1" catenateNumbers="1" catenateAll="0" 
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="0" catenateNumbers="0" catenateAll="0"
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

回答1:

I would also suggest to try a different Stemmer. There are 4 included in Solr

  1. solr.PorterStemFilterFactory
  2. solr.SnowballPorterFilterFactory
  3. solr.KStemFilterFactory
  4. solr.HunspellStemFilterFactory (you will need a dictionary for this one from an external source, like open office)

Each of those produces different results for your problem, see below. Given the results and that you do not need an external resource, I would also opt for KStem. If you do not fear to include a dictionary, I would go for hunspell.

  1. porter
    • Conditioner -> condition
    • Conditioning -> condit
  2. snowballporter
    • Conditioner -> condition
    • Conditioning -> condit
  3. kstem
    • Conditioner -> condition
    • Conditioning -> condition
  4. hunspell with en_GB
    • Conditioner -> condition
    • Conditioning -> conditioning; condition


回答2:

If only this particular case is important, you could override the stemmer:

StemmerOverrideFilterFactory

If the Porter stemmer is generally too aggressive, then try another stemmer like KStem.



标签: solr stemming