What Solr tokenizer and filters can I use for a st

2020-07-21 09:01发布

问题:

I'd like to ensure that searching for, say, I.B.M. can be found by searching for ibm. I'd also like to make sure that Dismemberment Plan could be found by searching for dismember.

Using Solr, what tokenizer and filters can I use in analysis and query time to permit both kinds of results?

回答1:

For I.B.M. => ibm
you would need a solr.WordDelimiterFilterFactory, which would strip special chars and catenate word and numbers

catenateWords="1" would catenate the words and transform I.B.M to IBM.

Dismemberment => dismember
Need to include a stemmer filter (e.g. solr.PorterStemFilterFactory, solr.EnglishMinimalStemFilterFactory) which would index the roots of the words and provide matches for words which have the same roots.

In addition you can use solr.LowerCaseFilterFactory for case insensitive matches (IBM and ibm), solr.ASCIIFoldingFilterFactory for handling foreign characters.

You can always use SynonymFilterFactory to map words which you think are synonyms.

you can apply this at both query and index time, so that they match and convert during both and the results are consistent.

e.g. field type def -

<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <!-- Index and Query time -->
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
        <!-- Stemmer -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters