pymongo: remove duplicates (map reduce?)

2019-04-06 09:54发布

问题:

I do have a Database with several collections (overall ~15mil documents) and documents look like this (simplified):

{'Text': 'blabla', 'ID': 101}
{'Text': 'Whuppppyyy', 'ID': 102}
{'Text': 'Abrakadabraaa', 'ID': 103}
{'Text': 'olalalaal', 'ID': 104}
{'Text': 'test1234545', 'ID': 104}
{'Text': 'whapwhapwhap', 'ID': 104}

They all have an unique _id field as well, but I want to delete duplicates accodring to another field (the external ID field).

First, I tried a very manual approach with lists and deleting afterwards, but the DB seems too big, takes very long and is not practical.

Second, the following does not work in current MongoDB versions anymore, even though anyone is suggesting it.

db.collection.ensureIndex( { ID: 1 }, { unique: true, dropDups: true } )

So, now I'm trying to create a map reduce solution, but I dont really know what Im doing and especially have difficulty using another field (not the database _id) to find and delete duplicates. Here is my bad first approach (adopted from some interent source):

map = Code("function(){ if(this.fieldName){emit(this.fieldName,1);}}")
reduce = Code("function(key,values) {return Array.sum(values);}")
res = coll.map_reduce(map,reduce,"my_results");

response = []
for doc in res.find():
    if(doc['value'] > 1):
        count = int(doc['value']) - 1
        docs = col.find({"fieldName":doc['ID']},{'ID':1}).limit(count)
        for i in docs:
            response.append(i['ID'])

coll.remove({"ID": {"$in": response}})

Any help to reduce any duplicates in the external ID field (leaving one entry), would be very much apprechiated ;) Thanks!

回答1:

An alternative approach is to use the aggregation framework which has better performance than map-reduce. Consider the following aggregation pipeline which as the first stage of the aggregation pipeline, the $group operator groups documents by the ID field and stores in the unique_ids field each _id value of the grouped records using the $addToSet operator. The $sum accumulator operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field. The other pipeline step $match filters documents with a count of at least 2, i.e. duplicates.

Once you get the result from the aggregation, you iterate the cursor to remove the first _id in the unique_ids field, then push the rest into an array that will be used later to remove the duplicates (minus one entry):

cursor = db.coll.aggregate(
    [
        {"$group": {"_id": "$ID", "unique_ids": {"$addToSet": "$_id"}, "count": {"$sum": 1}}},
        {"$match": {"count": { "$gte": 2 }}}
    ]
)

response = []
for doc in cursor:
    del doc["unique_ids"][0]
    for id in doc["unique_ids"]:
        response.append(id)

coll.remove({"_id": {"$in": response}})


回答2:

First, I tried a very manual approach with lists and deleting afterwards, but the DB seems too big, takes very long and is not practical.

The best bet is using the .aggregate() method which provides access to the aggregation pipeline to find those documents that are duplicate. The first stage in the pipeline is the $group stage where you group your documents by the duplicated key then use the $push and $sum accumulator operators which respectively return an array of all _id for each group and the count of elements in the group. The next and last stage in the pipeline is the $match stage to return only those result where there is duplicate "ID". From there you then iterate the cursor and update each document using "bulk" operations.

pipeline = [{'$group': {'_id': '$ID', 'count': {'$sum': 1}, 'ids': {'$push': '$_id'}}},
    {'$match': {'count': {'$gte': 2}}}]

bulk = db.collection.initialize_ordered_bulk_op()
count = 0
for document in db.collection.aggregate(pipeline):
    it = iter(document['ids'])
    next(it)
    for id in it:
        bulk.find({'_id': id}).remove_one({'_id': id})
        count = count + 1
        if count % 1000 == 0:
            bulk.execute()
    if count > 0:
        bulk.execute()

MongoDB 3.2 deprecates Bulk() and its associated methods so you will need to use the bulk_write() method to execute your request.

from pymongo import DeleteOne

request = []
for document in db.collection.aggregate(pipeline):
    it = iter(document['ids'])
    next(it)
    for id in it:
        requests.append(DeleteOne({'_id': id}))
db.collection.bulk_write(requests)

You can also do this in the shell as shown in the accepted answers to remove dups from mongodb and How to remove duplicates with a certain condition in mongodb?