Solr doesn't overwrite - duplicated uniqueKey

2020-07-09 08:14发布

问题:

I have a problem with Solr 5.3.1 . My Schema is rather simple. I have one uniqueKey which is the "id" as string. indexed, stored and required, non-multivalued.

I add documents first with a "content_type:document_unfinished" and then overwrite the same document, with the same id but another content_type:document. The document is then twice in the index. Again, the only uniqueKey is "id", as string. The id is coming originally from a mysql-index primary int.

Also looks like this happens not only once:

http://lucene.472066.n3.nabble.com/uniqueKey-not-enforced-td4015086.html

http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-td4129651.html

In my case not all the documents in the index are duplicated, just some. I was assuming - initially - that they are getting overwritten on commit when the same uniqueKey exists in the index. Which doesn't seem to work like I expected it. I do not want to simply update some fields in the document, I want to completely replace it, with all the children.

Some stats: around 350k documents in the index. Mostly with childDocuments. The Documents are distinguished by a "content_type" field. I used SolrJ to import them in that way:

HttpSolrServer server = new HttpSolrServer(url);
server.add(a Collection<SolrInputDocument>);
server.commit();

I am always adding a whole document with all the children again. Its nothing overly fancy. I end up with duplicated documents for the same uniqueKey. There are no side injections. I run only Solr with the integrated Jetty. I do not open the lucene index in java "manually".

What I did then was to delete+insert again. That seemed to work for a while, but then started under some conditions give this error message:

Parent query yields document which is not matched by parents filter

The document where that happens seems to be completely random, just one thing seems to emerge: its a childDocument where it happens. I do not run anything special, basically downloaded the solr package from the website and run it with bin/solr start

Anyone any ideas?

EDIT 1

I think I found the problem, which seems to be a bug? To reproduce the issue:

I downloaded Solr 5.3.1 to a Debian in a virtualBox and started it with bin/solr start. Added a new core with the basic config set. Nothing changed at the basic config set, just copied it over and added the core.

This leads to two documents with the same id in the index:

    SolrClient solrClient = new HttpSolrClient("http://192.168.56.102:8983/solr/test1");
    SolrInputDocument inputDocument = new SolrInputDocument();
    inputDocument.setField("id", "1");
    inputDocument.setField("content_type_s", "doc_unfinished");
    solrClient.add(inputDocument);
    solrClient.commit();
    solrClient.close();

    solrClient = new HttpSolrClient("http://192.168.56.102:8983/solr/test1");
    inputDocument = new SolrInputDocument();
    inputDocument.setField("id", "1");
    inputDocument.setField("content_type_s", "doc");
    SolrInputDocument childDocument = new SolrInputDocument();
    childDocument.setField("id","1-1");
    childDocument.setField("content_type_s", "subdoc");
    inputDocument.addChildDocument(childDocument);
    solrClient.add(inputDocument);
    solrClient.commit();
    solrClient.close();

Searching with:

http://192.168.56.102:8983/solr/test1/select?q=%3A&wt=json&indent=true

leads to the following output:

{

  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "q": "*:*",
      "indent": "true",
      "wt": "json",
      "_": "1450078098465"
    }
  },
  "response": {
    "numFound": 3,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "content_type_s": "doc_unfinished",
        "_version_": 1520517084715417600
      },
      {
        "id": "1-1",
        "content_type_s": "subdoc"
      },
      {
        "id": "1",
        "content_type_s": "doc",
        "_version_": 1520517084838101000
      }
    ]
  }
}

What am I doing wrong?

回答1:

Thanks for your feedback! I write this as answer since it is too long otherwise. I actually got the same response from the mailing list:

Mikhail Khludnev Hello Sebastian,

Mixing standalone docs and blocks doesn't work. There are a plenty of issues open.

On Wed, Mar 9, 2016 at 3:02 PM, Sebastian Riemer wrote:

Hi,

to actually describe my problem in short, instead of just linking to the test applicaton, using SolrJ I do the following:

1) Create a new document as a parent and commit

    SolrInputDocument parentDoc = new SolrInputDocument();
    parentDoc.addField("id", "parent_1");
    parentDoc.addField("name_s", "Sarah Connor");
    parentDoc.addField("blockJoinId", "1");
    solrClient.add(parentDoc);
    solrClient.commit();

2) Create a new document with the same unique-id as in 1) with a child document appended

    SolrInputDocument parentDocUpdateing = new SolrInputDocument();
    parentDocUpdateing.addField("id", "parent_1");
    parentDocUpdateing.addField("name_s", "Sarah Connor");
    parentDocUpdateing.addField("blockJoinId", "1");

    SolrInputDocument childDoc = new SolrInputDocument();
    childDoc.addField("id", "child_1");
    childDoc.addField("name_s", "John Connor");
    childDoc.addField("blockJoinId", "1");

    parentDocUpdateing.addChildDocument(childDoc);
    solrClient.add(parentDocUpdateing);
    solrClient.commit();

3) Results in 2 Documents with id="parent_1" in solr index

Is this normal behaviour? I thought the existing document should be updated instead of generating a new document with same id.

For a full working test application please see orginal message.

Best regards, Sebastian

I think it is a known issue, and there exist several tickets which kind of relate to this, but I am glad that there is a way to deal with it (adding child docs right from the beginning) (https://issues.apache.org/jira/browse/SOLR-6096, https://issues.apache.org/jira/browse/SOLR-5211, https://issues.apache.org/jira/browse/SOLR-7606)