Solr Fuzzy Search for similar words

2019-03-09 16:46发布

问题:

I am trying to do a fuzzy search for "jahngir" ~ 0.2, which does not return any results. My indexes has records with data "JAHANGIR RAHMAN MD". If I try a search with exact word "jahangir" ~ 0.2, it works. Can someone please help, on what I am doing wrong. I have spent a lot of time trying to figure out on how the Solr Fuzzy search works. Any links which explain Solr Fuzzy search would be helpful. Below is the text field that I am using for indexing. Thanks in advance.

 <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
      add enablePositionIncrements=true in both the index and query
      analyzers to leave a 'gap' for more accurate phrase queries.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
  </analyzer>
</fieldType>

Here is the configuration that worked for me after the response. Thanks!

<!-- Modified to fit fuzzy queries -->  
    <fieldType name="text_exact_fuzzy" class="solr.TextField" omitNorms="false">
      <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StandardFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

回答1:

No, you do not need to enable stemming, and the use of a stemmer may be causing the problem.

You have far too many filters on the text field. You are converting a word to a Porter stem, which often is not a real word, then taking the phonetic key of that. The surface word will rarely match the phonetic key stored in the index. The phonetic key will be very different from the original word.

Use the analyzer page in the admin UI to see how terms are processed.

I recommend splitting the kinds of approximate match into different fields.

  • text_exact: lowercase, that's about it
  • text_stem: lowercase and stem
  • text_phonetic: lowercase and double metaphone, do not stem

Use fuzzy matching with text_exact, because it handles typing errors. Do not use fuzzy against the other fields.

You can weight these fields differently, the exact match is a higher-quality match than the rest, so it can have a bigger weight. The stemmed match is a better match than phonetic, so it should have a weight smaller than exact, but bigger than phonetic.



回答2:

In order to get Fuzzy Searches to work, you will need to enable the correct Stemming and/or Filter Factory for your desired language. Please see the Langauge Analysis topic on the Solr Wiki for more details.

Edit: Please see Analyzers, Tokenizers and Token Filters for more details on the different ways of indexing your data and how this impacts the search of your data.