How to use properly the copy fields on solr for au

2019-06-11 11:06发布

问题:

I want to use the "autocomplete" for a search engine on my site.

So, I have a field called shortdesc with the following definition:

<field name="shortdesc" type="text_de" indexed="true" stored="false" />

The field type:

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index"> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LengthFilterFactory" min="3" max="20"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
        <filter class="solr.GermanNormalizationFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
        <filter class="solr.GermanNormalizationFilterFactory"/>
   </analyzer>
</fieldType>

So, now for do the autocomplete, I need an extra field (field_autocomplete) where Im gonna copy the field shortdesc. This field is defined as (I don't need to retrieve data from this field):

<field name="field_autocomplete" type="text_autocomplete" indexed="true" stored="false" multiValued="true" />

And the type definition:

<fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
        <filter class="solr.GermanNormalizationFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
        <filter class="solr.GermanNormalizationFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

And then, for copy the field:

    <copyField source="shortdesc" dest="field_autocomplete"/>

Ok, then, my fist question:

  • When indexing, all the content of the field text_autocomplete, comes from the copy of shortdesc, does that mean than a value on the field shortdesc is processed and then copy to field_autocomplete ? In that case, I don't need to apply the the filters on the type text_autocomplete because they are the same than in text_de and the source is gonna come with the filters already applied ? Is this right or I have to specify the filters for all of them (for each field I want "to capture" ?

And another question:

  • When I use the analyser, if I introduce a word that belong to the stopword, on the field text_de, the filter is applied and the word did't appear: But when I do the same on the field text_autocomplete , seems the word is there and stored as term, the filter didn't do nothing...

Can anybody give me a clue about this two things that are getting crazy ?

回答1:

  • You would need to define all the filters again. Nothing from the source field is applied.

Documentation for Copyfield :-

The original text is sent from the "source" field to the "dest" field, before any configured analyzers for the originating or destination field are invoked.

  • The stop filter seems to be missing format="snowball" which seems to be making the difference in the analysis.
    Also, usually it is recommended to have the same tokenizers and filters at both index and query time so that the indexed term matches the searched term. SO may just want to check the configurations again.