Using Solr for indexing multiple languages

2020-06-03 01:56发布

We're setting up a Solr to index documents where title field can be in various languages. After googling I found two options:

  1. Define different schema fields for every language i.e. title_en, title_fr,... applying different filters to each language then query one of title fields with a corresponding language.
  2. Creating different Solr cores to handle each language and make our app query correct Solr core.

Which one is better? What are the ups and downs?

Thanks

标签: java lucene solr
3条回答
叼着烟拽天下
2楼-- · 2020-06-03 02:03

There's also a third alternative where you use a common set of fields for all languages but apply a filter to a field language. For instance if you have the fields text, language you can put text contents for all languages in to the text field and use e.g., fq=language:english to only retrieve english documents.

The downside of this approach is that you cannot use language specific features such as lemmatisation, stemming, etc.

Define different schema fields for every language i.e. title_en, title_fr,... applying different filters to each language then query one of title fields with a corresponding language.

This approach gives good flexibility, but beware of high memory consumption and complexity when many languages are present. This can be mitigated using multiple solr servers.

Creating different Solr cores to handle each language and make our app query correct Solr core.

Definately a nice solution. But whether the separate administration and slight overhead will work for you is probably in relation to the number of languages you wish to use.

Unless the first approach is applicable, I would probably lean towards the second one unless the scalability of cores isn't desired. Either approach is fine though and I think it basicaly comes down to preference.

查看更多
戒情不戒烟
3楼-- · 2020-06-03 02:04
  • If you use multiple cores and you need sharding, one of the issue I can see is:

you will need to do sharding on each language (core). You won't be able to do sharding on the whole index at once.

  • If you use a single core, maybe you lose space with text columns that are "not full", not sure about that.
查看更多
SAY GOODBYE
4楼-- · 2020-06-03 02:08

It all depends on your requirements. I am assuming you dont need to query multiple languages in a single query. In that case splitting them into multiple cores would be a better idea since you can tweak around that core without affecting the other cores & index. With multiple languages there will be some tweaking or the other involved due to stemming, spell check & other features (if you plan to use them).

There is also an option of multiple solr webapps within a servlet container. So that could be an option you can look at.

It all depends on the flexibility that you had with regards to downtime that you could take to fix any issues.

查看更多
登录 后发表回答