Suppose I have a collection with some set of documents. something like this.
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}
I want to find all the duplicated entries in this collection by the "name" field. E.g. "foo" appears twice and "bar" appears 3 times.
The highest accepted answer here has this:
That would also return to you a new field called uniqueIds with a list of ids. But what if you just want the field and its count? Then it would be this:
To explain this, if you come from SQL databases like MySQL and PostgreSQL, you are accustomed to aggregate functions (e.g. COUNT(), SUM(), MIN(), MAX()) which work with the GROUP BY statement allowing you, for example, to find the total count that a column value appears in a table.
As you can see, our output shows the count that each my_type value appears. To find duplicates in MongoDB, we would tackle the problem in a similar way. MongoDB boasts aggregation operations, which group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. It's a similar concept to aggregate functions in SQL.
Assuming a collection called contacts, the initial setup looks as follows:
This aggregate function takes an array of aggregation operators, and in our case, we desire the $group operator, since our goal is to group the data by the field's count, that is, the number of occurances of the field value.
There's a little idiosyncracy to this approach. The _id field is required to use the group by operator. In this case, we are grouping the $name field. The key name within _id can have any name. But we use name since it is intuitive here.
By running the aggregation using only the $group operator, we will get a list of all the name fields (regardless if they appear once or more than once in the collection):
Notice above how aggregation works. It took documents with name fields and returns a new collection of the name fields extracted.
But what we want to know is how many times does the field value reappear. The $group operator takes a count field which uses the $sum operator to add the expression 1 to the total for each document in the group. So the $group and $sum together returns the collective sum of all the numeric values that result for a given field (e.g. name).
Since the goal was to eliminate duplicates, it requires one extra step. To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1.
As a side note, if you are using MongoDB through a ORM like Mongoid for Ruby, you might get this error:
This most likely means your ORM is out of date and is performing operations that MongoDB no longer supports. Consequently, either update your ORM or find a fix. For Mongoid, this was the fix for me:
The accepted answer is terribly slow on large collections, and doesn't return the
_id
s of the duplicate records.Aggregation is much faster and can return the
_id
s:In the first stage of the aggregation pipeline, the $group operator aggregates documents by the
name
field and stores inuniqueIds
each_id
value of the grouped records. The $sum operator adds up the values of the fields passed to it, in this case the constant1
- thereby counting the number of grouped records into thecount
field.In the second stage of the pipeline, we use $match to filter documents with a
count
of at least 2, i.e. duplicates.Then, we sort the most frequent duplicates first, and limit the results to the top 10.
This query will output up to
$limit
records with duplicate names, along with their_id
s. For example:For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using
group
. Note that aggregation is faster and more powerful in that it can return the_id
s of the duplicate records.For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:
Output will be this :
The equivalent SQL query would be:
SELECT name, COUNT(name) FROM prb GROUP BY name
. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates usinggroup
for the canonical solution usinggroup
.Note: this solution is the easiest to understand, but not the best.
You can use
mapReduce
to find out how many times a document contains a certain field:aggregation pipeline framework can be used to easily identify documents with duplicate key values:
~ Ref: useful information on an official mongo lab blog:
https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework