How can I improve MongoDB bulk performance?

2019-02-25 08:07发布

问题:

I have this object with some metadata and a big array of items. I used to store this in mongo, and querying it by $unwinding the array. However, in extreme cases, the array becomes so big that I run into 16MB BSON limitations.

So I need to store each element of the array as a separate document. For that I need to add the metadata to all of them, so I can find them back. It is suggested that I use bulk operations for this.

However, performance seems to be really slow. Inserting one big document was near-instant, and this takes up to ten seconds.

var bulk        = col.initializeOrderedBulkOp();
var metaData    = {
    hash            : hash,
    date            : timestamp,
    name            : name
};

// measure time here

for (var i = 0, l = array.length; i < l; i++) { // 6000 items
    var item = array[i];

    bulk.insert({ // Apparently, this 6000 times takes 2.9 seconds
        data        : item,
        metaData    : metaData
    });

}

bulk.execute(bulkOpts, function(err, result) { // and this takes 6.5 seconds
    // measure time here
});

Bulk inserting 6000 documents totalling 38 MB worth of data (which translates to 49 MB as BSON in MongoDB), performance seems unacceptably bad. The overhead of appending metadata to every document can't be that bad, right? The overhead of updating two indexes can't be that bad, right?

Am I missing something? Is there a better way of inserting groups of documents that need to be fetched as a group?

It's not just my laptop. Same on the server. Makes me think this is not a configuration error, rather a programming error.

Using MongoDB 2.6.11 with node adapter node-mongodb-native 2.0.49

-update-

Just the act of adding the metadata to every element in the bulk accounts for 2.9 seconds. There needs to be a better way of doing this.

回答1:

Send the bulk insert operations in batches as this results in less traffic to the server and thus performs efficient wire transactions by not sending everything all in individual statements, but rather breaking up into manageable chunks for server commitment. There is also less time waiting for the response in the callback with this approach.

A much better approach with this would be using the async module so even looping the input list is a non-blocking operation. Choosing the batch size can vary, but selecting batch insert operations per 1000 entries would make it safe to stay under the 16MB BSON hard limit, as the whole "request" is equal to one BSON document.

The following demonstrates using the async module's whilst to iterate through the array and repeatedly call the iterator function, while test returns true. Calls callback when stopped, or when an error occurs.

var bulk = col.initializeOrderedBulkOp(),
    counter = 0,
    len = array.length,
    buildModel = function(index){   
        return {
            "data": array[index],
            "metaData": {
                "hash": hash,
                "date": timestamp,
                "name": name
            }
        }
    };

async.whilst(
    // Iterator condition
    function() { return counter < len },

    // Do this in the iterator
    function (callback) {
        counter++;
        var model = buildModel(counter);
        bulk.insert(model);

        if (counter % 1000 == 0) {
            bulk.execute(function(err, result) {
                bulk = col.initializeOrderedBulkOp();
                callback(err);
            });
        } else {
            callback();
        }
    },

    // When all is done
    function(err) {
        if (counter % 1000 != 0) {
            bulk.execute(function(err, result) {
                console.log("More inserts.");
            }); 
        }           
        console.log("All done now!");
    }
);