Defining nested entities in Solr Data Import Handl

2020-02-12 13:22发布

问题:

Let me preface by mentioning that I've been through everything I could find about this topic including the Solr docs and all of the SO questions.

I have a Solr instance that I've setup with a Data Import Hanlder to pull in data from MSSQL using the JDBC driver. The data comes in, but it isn't structured as I'd expect based on the Solr DIH documentation

<document>
 <entity>
  <entity />
 </entity>
</document>

I've tried all the attributes like rootEntity, flatten, using CachedSqlProvider, etc. With multiValued="True" The result ends up

docs [
{
  recordId: '1234',
  name: 'whatever'
  subrows_col1: ['x','y','z']
  subrows_col2: ['a','b','c']
}
]

When I'm looking for

docs [
{
  recordId: '1234',
  name: 'whatever'
  subrows: [{
     col1: 'x',
     col2: 'a'
 },
  {
     col1: 'y',
     col2: 'b'
 },
 {
     col1: 'z',
     col2: 'c'
 }]
} ]

I've seen the block-join stuff, but I'm confused as to where it goes. I added

<add>
 <doc>
  <field />
  <doc>
   <field />
  </doc>
 <doc>
</add>

to the DIH requestHandler, but it did nothing. I added it to the /update requestHandler and I got an error. I have no clue where that is supposed to go. Does it only work during a query or is it only for when you push data to solr via /update?

Where do I define the structure for the document? I tried nested fields in the schema, entities in the DIH config and the block-join stuff in the requestHandlers. nothing has worked yet.

Obviously I'm missing something.

回答1:

DIH does not produce nested documents. Solr supports them, but DIH can't yet generate them.

The nested entities in DIH is to be able to merge sources and to be able to create entities based on iteration from a different source. E.g. if the outer entity reads a file for file names and inner entity loads content from those files with each file getting its own record.

You may want to move your nested object code into the client with SolrJ for now.



回答2:

Indexing nested document in DIH is finally supported from Solr 5.1 onwards.

https://issues.apache.org/jira/browse/SOLR-5147

Simply adding child=true to the child entity, then Solr DIH will automagically indexes as child document.

Example taken from JIRA (in the link above) :

<document>
  <entity name='PARENT' query='select * from PARENT'>
    <field column='id' />
    <field column='desc' />
    <field column='type_s' />
    <entity child='true' name='CHILD' query="select * from CHILD where parent_id='${PARENT.id}'">
      <field column='id' />
      <field column='desc' />
      <field column='type_s' />
  </entity>
</entity>
</document>

I've also decompiled DocBuilder.class in solr-dataimporthandler-5.3.0.jar, found this code snippet : -

if (doc != null) {
    if (epw.getEntity().isChild())
    {
        childDoc = new DocWrapper();
        handleSpecialCommands(arow, childDoc);
        addFields(epw.getEntity(), childDoc, arow, vr);
        doc.addChildDocument(childDoc);
    }
    else
    {
        handleSpecialCommands(arow, doc);
        addFields(epw.getEntity(), doc, arow, vr);
    }
}

Noticed that if epw.getEntity().isChild() will return true if child="true" is set, thus it's creating a new DocWrapper and add as child document instead of simply adding the entity as a bunch of new fields.



标签: solr