Solr: DIH for multilingual index & multiValued fie

2020-06-28 01:24发布

I have a MySQL table:

CREATE TABLE documents (
    id INT NOT NULL AUTO_INCREMENT,
    language_code CHAR(2),
    tags CHAR(30),
    text TEXT,
    PRIMARY KEY (id)
);

I have 2 questions about Solr DIH:

1) The langauge_code field indicates what language the text field is in. And depending on the language, I want to index text to different Solr fields.

# pseudo code

if langauge_code == "en":
    index "text" to Solr field "text_en"
elif langauge_code == "fr":
    index "text" to Solr field "text_fr"
elif langauge_code == "zh":
    index "text" to Solr field "text_zh"
...

Can DIH handle a usecase like this? How do I configure it to do so?

2) The tags field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if tags contains the string "blue, green, yellow" then I want to index the 3 values "blue", "green", "yellow" into a Solr multiValued field.

How do I do that with DIH?

Thanks.

2条回答
够拽才男人
2楼-- · 2020-06-28 01:45

First your schema needs to allow it with something like this:

<dynamicField name="text_*" type="string" indexed="true" stored="true" />

Then in your DIH config something like this:

<entity name="document" dataSource="ds1" transformer="script:ftextLang" query="SELECT * FROM documents" />

With the script being defined just below the datasource:

<script><![CDATA[
  function ftextLang(row){
     var name = row.get('language_code');
     var value = row.get('text');
     row.put('text_'+name, value); return row;
  }
]]></script>
查看更多
放我归山
3楼-- · 2020-06-28 02:00

I'm sorry I don't have a direct answer about your DIH question, though it'd be interesting to know.

I did notice your 2 letter language code and suggest a 5 letter slot. Some languages have dialect differences that are non trivial. For example, Simplified Chinese vs. Traditional Chinese. For morphological analysis, the SmartCN filter can handle zh-cn, but not zh-tw, etc.

Portuguese and Spanish are also languages where we've been warned against mixing all dialects together, although the differences are less drastic, and both would still be searchable.

Of course you may have already known this, and just didn't add it to the question to keep it simple. It's just a subject very fresh on my mind.

查看更多
登录 后发表回答