I am now using Solr to index on a field. This field will contain both Chinese and English. At the same time, I need to use tokenizer NGramTokenizerFactory for searching.
Below is the current field type I defined for the field:
<fieldType name="text_general2" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I have to set minGramSize="1"
to allow searching a single Chinese character. However, this is totally improper for searching an English word.
e.g. If I search "see", it returns "s", "se", "ee", "see", "e"
Therefore, could anyone please tell what is the best way to index a field that contains both Chinese and English?