This question is similar to Solr doesn't overwrite - duplicated uniqueKey entries, but I am in a situation where I have a large body of existing documents that have already been added to the collection with no child documents, and I am using (standalone not cloud) Solr 6.4 rather than 5.3.1. We recently enabled child documents so that we could store richer data.
We use SolrJ to load data into and query Solr, but to isolate the issue we're seeing, I used the command line Solr post
tool to upload the following document:
<add>
<doc>
<field name="id">1</field>
<field name="solr_record_type">1</field>
<field name="title">Fabulous Book</field>
<field name="author">Angelo Author</field>
</doc>
</add>
Search results were as expected:
Using q=id:1
and
fl=id,title,index_date,[child parentFilter="solr_record_type:1"]
"response":{"numFound":1,"start":0,"docs":[
{
"id":"1",
"title":"Fabulous Book",
"index_date":"2019-01-16T23:06:57.221Z"}]
}
Then I updated the document by posting the following:
<add>
<doc>
<field name="id">1</field>
<field name="solr_record_type">1</field>
<field name="title">Fabulous Book</field>
<field name="author">Angelo Author</field>
<doc>
<field name="id">1-1</field>
<field name="solr_record_type">2</field>
<field name="contributor_name">Polly Math</field>
<field name="contributor_type">3</field>
</doc>
</doc>
</add>
Then, repeating my search, I got the following duplicate result, searching on the unique id field, which is undesirable.
"response":{"numFound":2,"start":0,"docs":[
{
"id":"1",
"title":"Fabulous Book",
"index_date":"2019-01-16T23:06:57.221Z",
"_childDocuments_":[
{
"id":"1-1",
"solr_record_type":2,
"contributor_name":"Polly Math",
"contributor_type":3,
"index_date":"2019-01-16T23:09:29.142Z"}]},
{
"id":"1",
"title":"Fabulous Book",
"index_date":"2019-01-16T23:09:29.142Z",
"_childDocuments_":[
{
"id":"1-1",
"solr_record_type":2,
"contributor_name":"Polly Math",
"contributor_type":3,
"index_date":"2019-01-16T23:09:29.142Z"}]}]
}
Going the other way, if I start with a document that was loaded initially with a child document, like the following:
<add>
<doc>
<field name="id">2</field>
<field name="solr_record_type">1</field>
<field name="title">Wonderful Book</field>
<field name="author">Andy Author</field>
<doc>
<field name="id">2-1</field>
<field name="solr_record_type">2</field>
<field name="contributor_name">Polly Math</field>
<field name="contributor_type">3</field>
</doc>
</doc>
</add>
And then I update it with a document with no children:
<add>
<doc>
<field name="id">2</field>
<field name="solr_record_type">1</field>
<field name="title">Wonderful Book</field>
<field name="author">Andy Author</field>
</doc>
</add>
The result still has the child:
"response":{"numFound":1,"start":0,"docs":[
{
"id":"2",
"title":"Wonderful Book",
"index_date":"2019-01-16T23:09:39.389Z",
"_childDocuments_":[
{
"id":"2-1",
"title_id":2,
"title_instance_id":2,
"solr_record_type":2,
"contributor_name":"Polly Math",
"contributor_type":3,
"index_date":"2019-01-16T23:07:04.861Z"}]}]
}
This is strange because if I update a document with 2 child documents with a replacement document with only 1 child document, it does drop one child document. But in this case, it is not dropping the child document.
Updates of documents with no child documents that don't add child documents, and updates of documents with child documents that don't remove all child documents both seem to work as I'd expect.
I have a large body of existing documents that don't have children, which I may be adding children to, and eventually I may have a lot of child-having documents that might drop their children. Given that, what is the best way to update these records without generating duplicate records or losing updates?