How to define a field type for field that contains

2019-09-03 19:34发布

问题:

I am now using Solr to index on a field. This field will contain both Chinese and English. At the same time, I need to use tokenizer NGramTokenizerFactory for searching.

Below is the current field type I defined for the field:

<fieldType name="text_general2" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I have to set minGramSize="1" to allow searching a single Chinese character. However, this is totally improper for searching an English word.

e.g. If I search "see", it returns "s", "se", "ee", "see", "e"

Therefore, could anyone please tell what is the best way to index a field that contains both Chinese and English?

回答1:

I'm sure that this isn't the answer you were hoping for, but it's the answer that will actually solve it: Don't use a single field to contain both chinese and english.

Have one field for english and one field for chinese, indexing to the field matching the language of your input content. You can use the Language Detection feature in an update processor to let Solr decide which field to put the content into during indexing if you don't know the language when indexing.

Searching is then done across both fields (depending on your query handler, possibly using qf), allowing for separate processing of tokens in each language against each field (so that english words doesn't get ngram-ed).

If you have both english and chinese in the same document, process the document to decide the chinese and english parts (for example, iterate over each paragraph and detect language, before indexing to different fields).



标签: solr