Indexing multi-lingual content with Lucene.net

2019-05-07 02:58发布

I use Lucene.net for indexing content & documents etc.. on websites. The index is very simple and has this format:

LuceneId - unique id for Lucene (TypeId + ItemId)
TypeId   - the type of text (eg. page content, product, public doc etc..)
ItemId   - the web page id, document id etc..
Text     - the text indexed
Title    - web page title, document name etc.. to display with the search results

I've got these options to adapt it to serve multi-lingual content:

Create a separate index for each language. E.g. Lucene-enGB, Lucene-frFR etc..
Keep the one index and add an additional 'language' field to it to filter the results.

Which is the best option - or is there another? I've not used multiple indexes before so I'm leaning toward the second.

标签： search localization lucene.net multilingual

2条回答

【Aperson】

2楼-- · 2019-05-07 03:50

I do [2], but one problem I have is that I cannot use different analyzers depending on the language. I've combined the stopwords of the languages I want, but I lose the capability of more advanced stuff that the analyzer will offer such as stemming etc.

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-05-07 03:55

You can eliminate option 1 and 2.
You can use one index and the fields that contains arabic words create two fileds for each: If you have field "Text" might contain arabic or english contents ==>

Create 2 fields for "Text" : 1 field, "Text", indexed/searched with your standard analyzer and another one, "Text_AR" , with the arabicAnalyzer. In order to achieve that you can use PreFieldAnalyzerWrapper

0人赞添加讨论(0) 举报

Indexing multi-lingual content with Lucene.net

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间