Duplicate documents on _id (in mongo)

2019-02-24 18:07发布

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).

I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.

My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.

I've removed the duplicates, but others still appear.

Do you have any ideas where could they come from, or what should I start looking at? (Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).

2条回答
狗以群分
2楼-- · 2019-02-24 18:22

This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.

In the MongoDB: Configuring Sharding documentation there is specific mention that:

  • When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.

  • You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.

  • If the "unique: true" option is not used, the shard key does not have to be unique.

查看更多
ら.Afraid
3楼-- · 2019-02-24 18:29

How have you implemented generating the integer Ids?

If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:

function counter(name) {
    var ret = db.counters.findAndModify({
         query:{_id:name}, 
         update:{$inc:{next:1}}, 
         "new":true, 
         upsert:true});

    return ret.next;
}

db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2

If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.

查看更多
登录 后发表回答