How to configure stemming in Solr?

2019-02-10 02:45发布

问题:

I add to solr index: "American". When I search by "America" there is no results.

How should schema.xml be configured to get results?

current configuration:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
        </fieldType>

回答1:

Why would you have two stemmers?
Try removing EnglishPorterFilterFactory (deprecated) from both of your analyzer types, rebuild the index and then try whether search for American will yield America.

If that wont work, the other thing you can try is to remove both of your stemmer filters and add SnowballPorterFilterFactory with language="English" instead.



回答2:

You have to use one stemmer for an analyzer and EnglishPorterFilterFactory is deprecated as @Marko already mentioned. So you should remove this one from analyzers.

I used SnowballPorterFilterFactory for both index and query analyzers -

<fieldType name="text_stem">
    <analyzer> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
        <!-- other filters -->
    </analyzer>
</fieldType>

The fieldType definition is pretty self explanatory, but just in case:

  • Tokenizer solr.WhitespaceTokenizerFactory: This operation will break up the sentences into words, using whitespaces as delimiters.

  • Filter solr.SnowballPorterFilterFactory: This filter will apply a stemming algorithm to each word (token). In the example above I have chosen the Snowball Porter stemming algorithm. Solr provides a few implementation of popular stemming algorithms.

You can browse several other stemming algorithms e.g. HunspellStemFilterFactory, KStemFilterFactory too.



标签: solr stemming