Solr - russian synonyms are not working

2019-08-01 06:27发布

问题:

I have solr v4.8.0 on ubuntu 12.04 LTS.

I have field in schema.xml with filter solr.SynonymFilterFactory.

    <fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" />
    <filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
  </analyzer>
</fieldType>

I have next mapping

spidermen, superman, batman, бетмен, бетмэн, спайдермен, спайдермэн, супермен, супермэн, spiderman

I checked encoding of "synonyms.txt" file and it is utf-8.

The queries with english synonyms work fine. I have problem only with russian synonyms, they are not working, solr ignores them. I cannot manage the problem myself.

Added by me after 30 minutes: Somehow the words: "бетмэн", "спайдермэн" are found in search results, but "бетмен", "спайдермен" are not.

回答1:

Try swapping the order of the synonym and the porter filters. As it is, you are looking in the synonym file after you chopped off your words' endings. And probably just not matching.

The Analysis screen in the admin Web UI is a great tool to see what happens with the text as it goes through individual filters.



回答2:

I've just write a small test for this case - and I find out, that stemming is cause this issue. When, I disable it - everything works smoothly, also swapping it with synonyms help as well.

Reference to test - https://github.com/MysterionRise/information-retrieval-adventure/blob/master/lucene5/src/main/scala/org/mystic/SynonymsAndStopwords.scala



标签: solr