Non-English Language support via SolrNet

2019-06-02 04:22发布

问题:

I am using SolrNet to search over Solr from an .NET application. Everything works fine when I search over English words. However if I use spanish words like español, I get no search result though I have indexed them. When I debugged over Solr, I found that the query was parsed as espaA+ol.

Do I have to do some UTF-8 encoding or does SolrNet supports search over only ASCII characters?

回答1:

This is not a SolrNet issue, it is related to how Solr handles characters that are not in the first 127 ASCII character set. The best recommendation is add the ASCIIFoldingFilterFactory to your Solr field where you are storing the Spanish words.

As an example, if you were using the text_general fieldType as defined in the Solr example which is setup as follows in the schema.xml file:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I would recommend modifying it as follows adding the ASCIIFoldingFilterFactory to the index and query analyzers.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
</fieldType>

Also, please note that you will need to reindex your data after making this schema change for the changes to be reflected in the index.



回答2:

Not sure if you want to specifically keep those characters in the index? If you don't need to, it would be better to use something like

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

so 'español' would be indexed as 'espanol' and searching for any of them would find 'español' (same for á, ü etc).