I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).
I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.
My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.
I've removed the duplicates, but others still appear.
Do you have any ideas where could they come from, or what should I start looking at? (Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).
This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.
In the MongoDB: Configuring Sharding documentation there is specific mention that:
When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.
If the "unique: true" option is not used, the shard key does not have to be unique.
How have you implemented generating the integer Ids?
If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:
If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.