Find all duplicate documents in a MongoDB collecti

2020-01-26 03:40发布

站内文章 / MongoDB

57 0

贪生不怕死

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Suppose I have a collection with some set of documents. something like this.

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

I want to find all the duplicated entries in this collection by the "name" field. E.g. "foo" appears twice and "bar" appears 3 times.

回答1:

Note: this solution is the easiest to understand, but not the best.

You can use mapReduce to find out how many times a document contains a certain field:

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

回答2:

The accepted answer is terribly slow on large collections, and doesn't return the _ids of the duplicate records.

Aggregation is much faster and can return the _ids:

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

In the first stage of the aggregation pipeline, the $group operator aggregates documents by the name field and stores in uniqueIds each _id value of the grouped records. The $sum operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field.

In the second stage of the pipeline, we use $match to filter documents with a count of at least 2, i.e. duplicates.

Then, we sort the most frequent duplicates first, and limit the results to the top 10.

This query will output up to $limit records with duplicate names, along with their _ids. For example:

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

回答3:

For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group. Note that aggregation is faster and more powerful in that it can return the _ids of the duplicate records.

For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

Output will be this :

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group for the canonical solution using group.

回答4:

aggregation pipeline framework can be used to easily identify documents with duplicate key values:

// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

~ Ref: useful information on an official mongo lab blog:

https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework

回答5:

The highest accepted answer here has this:

uniqueIds: { $addToSet: "$_id" },

That would also return to you a new field called uniqueIds with a list of ids. But what if you just want the field and its count? Then it would be this:

db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

To explain this, if you come from SQL databases like MySQL and PostgreSQL, you are accustomed to aggregate functions (e.g. COUNT(), SUM(), MIN(), MAX()) which work with the GROUP BY statement allowing you, for example, to find the total count that a column value appears in a table.

SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

As you can see, our output shows the count that each my_type value appears. To find duplicates in MongoDB, we would tackle the problem in a similar way. MongoDB boasts aggregation operations, which group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. It's a similar concept to aggregate functions in SQL.

Assuming a collection called contacts, the initial setup looks as follows:

db.contacts.aggregate([ ... ]);

This aggregate function takes an array of aggregation operators, and in our case, we desire the $group operator, since our goal is to group the data by the field's count, that is, the number of occurances of the field value.

db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

There's a little idiosyncracy to this approach. The _id field is required to use the group by operator. In this case, we are grouping the $name field. The key name within _id can have any name. But we use name since it is intuitive here.

By running the aggregation using only the $group operator, we will get a list of all the name fields (regardless if they appear once or more than once in the collection):

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

Notice above how aggregation works. It took documents with name fields and returns a new collection of the name fields extracted.

But what we want to know is how many times does the field value reappear. The $group operator takes a count field which uses the $sum operator to add the expression 1 to the total for each document in the group. So the $group and $sum together returns the collective sum of all the numeric values that result for a given field (e.g. name).

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

Since the goal was to eliminate duplicates, it requires one extra step. To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1.

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

As a side note, if you are using MongoDB through a ORM like Mongoid for Ruby, you might get this error:

The 'cursor' option is required, except for aggregate with the explain argument

This most likely means your ORM is out of date and is performing operations that MongoDB no longer supports. Consequently, either update your ORM or find a fix. For Mongoid, this was the fix for me:

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end