I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
"source_references" : [
"_id" : ObjectId("5045xxxxxxxxxxxxxx"),
"name" : "xxx",
"key" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key
. (By Duplicate I mean, source_references.key
not the _id
).
I want to remove duplicate records based on source_references.key
, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
This is the easiest query I used on my MongoDB 3.2
Index your
customKey
before running this to increase speedpip install mongo_remove_duplicate_indexes
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
If you have enough memory, you can in scala do something like that:
If you are certain that the
source_references.key
identifies duplicate records, you can ensure a unique index with thedropDups:true
index creation option in MongoDB 2.6 or older:This will keep the first unique document for each
source_references.key
value, and drop any subsequent documents that would otherwise cause a duplicate key violation.Important Notes:
dropDups
option was removed in MongoDB 3.0, so a different approach will be required. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.source_references.key
field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add thesparse:true
index creation option so the index only applies to documents with asource_references.key
field.Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
Remove duplicates by aggregation framework.
a. If you want to delete in one go.
b. You can delete documents one by one.
While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options