DoubleMetaphoneFilterFactory in Solr

2019-07-07 22:20发布

问题:

My purpose is to integrate solr so that the results returned from my application are accurate and fast. I am performing the search over name field using doublemetaphonic so that the names that sound similar are also captured then using the fuzzy search(That uses levenshtein distance algorithm) fetch the results above certain percentage.The problem is when I put the doublemetaphonic on the feild type name then I am unable to perform fuzzy search over that field.

The example configuration from my schema.xml looks like:

<field name="sdn_names" type="doublemetaphonetic" indexed="true" stored="true"     termVectors="true"/>
<!--Defination of doublemetaphonic.-->
<fieldtype name="doublemetaphonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
  </analyzer>
</fieldtype>

From my solr UI when I tried to search sdn_names:abdul~0.50 then it returns 0 results and if I change my query String to sdn_names:abdul then i get 180 records in the resultset. I used to search over for the solution and found that when we use the doublemetaphonic for indexing then the phonetic value is different from the orignal value and the levenshtein distance calculated is very large between two strings so the results are 0. Please provide me any links or recommanded solution/reading for the problem as i am new to solr. Thanks in advance

回答1:

Metaphone and Wildcards are just not compatible.

Firstly, Lucene does not analyze terms with wildcards, fuzzy matching, regex, etc. As such, you are trying to search plain text against metaphone codes. So, you have:

  • In index: APTL
  • In query: abdul~0.5

Which I think makes it more obvious why you don't get any matches. That's a levenshtein distance of 3, which is considerable.

Mixing metaphone with wildcards doesn't make a great deal of sense. A valid metaphone match should be an exact match. The metaphone algorithm reduces the term to a code representing is first four sounds (simplifying somewhat).

These are two different and separate methods of searching for relevant looser results. They should be kept separate, so if you want to be able to search on both fuzzy matching and metaphone, the best idea would be to index the metaphones and full text in two different fields, and then search on both of them. Something like:

<field name="sdn_names_phonetic" type="doublemetaphonetic" indexed="true" stored="false" termVectors="true"/>
<field name="sdn_names" type="text_standard" indexed="true" stored="true" termVectors="true"/>

<fieldType name="text_standard" class="solr.TextField"> 
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> 
</fieldType> 
<fieldtype name="doublemetaphonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
  </analyzer>
</fieldtype>

(Note: I've changes your metaphone fields to stored=false, since both of these fields would store the same data, there is no need to store both of them).

Which could be searched like:

sdn_names:abdul~0.5 sdn_names_phonetic:abdul

See the solr documentation section: Indexing same data in multiple fields, for a bit more about this sort of pattern.