Solr index and search multilingual data

2019-05-19 07:31发布

问题:

In my Solr schema during indexing Solr detects a language of the data being indexed and applies different indexing rules according to the language it's detected. All data is stored in language specific fields, for example:

  • English titles are stored in title_en field.
  • Spanish titles are stored in title_es field.

-

<field name="title_en" type="text_en" indexed="true" stored="true"/>
<field name="title_es" type="text_es" indexed="true" stored="true"/>

All searches are made against one catch-all field "text":

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

All language specific fields are copied to "text" field, in order to be available for search query:

<copyField source="title_en" dest="text"/>
<copyField source="title_es" dest="text"/>

My concern is: since "text" field is doing indexing of its own, applying I assume "text_general" indexing rules, then re-indexing takes place and I guess all previous language specific indexing rules for the language specific fields (title_en, title_es) are lost.

If so, then how do I do search in one query across all data, preserving language specific indexes?

回答1:

Yes, the data stored in text (defined as text_general) is only processed according to the rules for that field - and is not affected by title_en or title_es. copyField happens before any processing of the value, since you usually (as in this case) want to perform different tokenization and analysis on the field.

An easy solution is to query the title_en and title_es fields if you want to search both, by using the query fields parameter: qf=title_en,title_es. This will search both the english and spanish version of your processed content according to your query.



标签: solr indexing