I am doing a multilingual search. And I will use lucene as the tool to do it.
I have the translated contents already, there will be 3 or 4 languages of each document.
For indexing and search, there could be the 4 strategies, For each document/contents:
- each language are indexed in different index/directory.
- each language are indexed in different document but in the same index.
- each language are indexed in different Field but in the same document.
- all the languages are indexed in the same Field in a document
But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?
Thanks!
Although the question has been asked a couple of years ago, it's still a great question.
There are a couple of aspects to consider evaluating the different solution approaches:
If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.
Single Field (Strategies 2 & 4)
Multiple Fields (Strategy 3)
Multiple Indices (Strategy 1)
Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.
Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.
In short, it depends on your needs, but I would go with option 3 or 1.
1) would probably the best way, if there is no overlap / shared fields between the languages at all.
3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache
I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.
4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.