How can I use a cursor.forEach() in MongoDB using

2019-01-06 13:05发布

问题:

I have a huge collection of documents in my DB and I'm wondering how can I run through all the documents and update them, each document with a different value.

回答1:

The answer depends on the driver you're using. All MongoDB drivers I know have cursor.forEach() implemented one way or another.

Here are some examples:

node-mongodb-native

collection.find(query).forEach(function(doc) {
  // handle
}, function(err) {
  // done or error
});

mongojs

db.collection.find(query).forEach(function(err, doc) {
  // handle
});

monk

collection.find(query, { stream: true })
  .each(function(doc){
    // handle doc
  })
  .error(function(err){
    // handle error
  })
  .success(function(){
    // final callback
  });

mongoose

collection.find(query).stream()
  .on('data', function(doc){
    // handle doc
  })
  .on('error', function(err){
    // handle error
  })
  .on('end', function(){
    // final callback
  });

Updating documents inside of .forEach callback

The only problem with updating documents inside of .forEach callback is that you have no idea when all documents are updated.

To solve this problem you should use some asynchronous control flow solution. Here are some options:

  • async
  • promises (when.js, bluebird)

Here is an example of using async, using its queue feature:

var q = async.queue(function (doc, callback) {
  // code for your update
  collection.update({
    _id: doc._id
  }, {
    $set: {hi: 'there'}
  }, {
    w: 1
  }, callback);
}, Infinity);

var cursor = collection.find(query);
cursor.each(function(err, doc) {
  if (err) throw err;
  if (doc) q.push(doc); // dispatching doc to async.queue
});

q.drain = function() {
  if (cursor.isClosed()) {
    console.log('all items have been processed');
    db.close();
  }
}


回答2:


var MongoClient = require('mongodb').MongoClient,
assert = require('assert');

MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {

assert.equal(err, null);
console.log("Successfully connected to MongoDB.");

var query = {
"category_code": "biotech"
};

db.collection('companies').find(query).toArray(function(err, docs) {

assert.equal(err, null);
assert.notEqual(docs.length, 0);

docs.forEach(function(doc) {
console.log(doc.name + " is a " + doc.category_code + " company.");
});

db.close();

});

});

Notice that the call .toArray is making the application to fetch the entire dataset.


var MongoClient = require('mongodb').MongoClient,
assert = require('assert');

MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {

assert.equal(err, null);
console.log("Successfully connected to MongoDB.");

var query = {"category_code": "biotech"};

var cursor = db.collection('companies').find(query);

function(doc) {
cursor.forEach(
console.log( doc.name + " is a " + doc.category_code + " company." );
},
function(err) {
assert.equal(err, null);
return db.close();
}
);
});

Notice that the cursor returned by the find() is assigned to var cursor. With this approach, instead of fetching all data in memort and consuming data at once, we're streaming the data to our application. find() can create a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor is to describe our query. The 2nd parameter to cursor.forEach shows what to do when the driver gets exhausted or an error occurs.

In the initial version of the above code, it was toArray() which forced the database call. It meant we needed ALL the documents and wanted them to be in an array.

Also, MongoDB returns data in batch format. The image below shows, requests from cursors (from application) to MongoDB

forEach is better than toArray because we can process documents as they come in until we reach the end. Contrast it with toArray - where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it, if you can in your application.



回答3:

Leonid's answer is great, but I want to reinforce the importance of using async/promises and to give a different solution with a promises example.

The simplest solution to this problem is to loop forEach document and call an update. Usually, you don't need close the db connection after each request, but if you do need to close the connection, be careful. You must just close it if you are sure that all updates have finished executing.

A common mistake here is to call db.close() after all updates are dispatched without knowing if they have completed. If you do that, you'll get errors.

Wrong implementation:

collection.find(query).each(function(err, doc) {
  if (err) throw err;

  if (doc) {
    collection.update(query, update, function(err, updated) {
      // handle
    });
  } 
  else {
    db.close(); // if there is any pending update, it will throw an error there
  }
});

However, as db.close() is also an async operation (its signature have a callback option) you may be lucky and this code can finish without errors. It may work only when you need to update just a few docs in a small collection (so, don't try).


Correct solution:

As a solution with async was already proposed by Leonid, below follows a solution using Q promises.

var Q = require('q');
var client = require('mongodb').MongoClient;

var url = 'mongodb://localhost:27017/test';

client.connect(url, function(err, db) {
  if (err) throw err;

  var promises = [];
  var query = {}; // select all docs
  var collection = db.collection('demo');
  var cursor = collection.find(query);

  // read all docs
  cursor.each(function(err, doc) {
    if (err) throw err;

    if (doc) {

      // create a promise to update the doc
      var query = doc;
      var update = { $set: {hi: 'there'} };

      var promise = 
        Q.npost(collection, 'update', [query, update])
        .then(function(updated){ 
          console.log('Updated: ' + updated); 
        });

      promises.push(promise);
    } else {

      // close the connection after executing all promises
      Q.all(promises)
      .then(function() {
        if (cursor.isClosed()) {
          console.log('all items have been processed');
          db.close();
        }
      })
      .fail(console.error);
    }
  });
});


回答4:

And here is an example of using a Mongoose cursor async with promises:

new Promise(function (resolve, reject) {
  collection.find(query).cursor()
    .on('data', function(doc) {
      // ...
    })
    .on('error', reject)
    .on('end', resolve);
})
.then(function () {
  // ...
});

Reference:

  • Mongoose cursors
  • Streams and promises


回答5:

The node-mongodb-native now supports a endCallback parameter to cursor.forEach as for one to handle the event AFTER the whole iteration, refer to the official document for details http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach.

Also note that .each is deprecated in the nodejs native driver now.



回答6:

Using the mongodb driver, and modern NodeJS with async/await, a good solution is to use next():

const collection = db.collection('things')
const cursor = collection.find({
  bla: 42 // find all things where bla is 42
});
let document;
while ((document = await cursor.next())) {
  await collection.findOneAndUpdate({
    _id: document._id
  }, {
    $set: {
      blu: 43
    }
  });
}

This results in only one document at a time being required in memory, as opposed to e.g. the accepted answer, where many documents get sucked into memory, before processing of the documents starts. In cases of "huge collections" (as per the question) this may be important.

If documents are large, this can be improved further by using a projection, so that only those fields of documents that are required are fetched from the database.