Insert performance of node-mongodb-native

2020-07-22 17:38发布

问题:

I'm testing performance of Node.js with MongoDB. I know each of these is fine independent of the other, but I'm trying a handful of tests to get a feel for them. I ran across this issue and I'm having trouble determining the source.

The Problem

I'm trying to insert 1,000,000 records in a single Node.js program. It absolutely crawls. We're talking 20 minute execution time. This occurs whether it's my Mac or CentOS, although the behavior is marginally different between the two. It does eventually complete.

The effect is similar to swapping, although it's not (memory never exceeds 2 GB). There are only 3 connections open to MongoDB, and most of the time there's no data being inserted. It appears to be doing a lot of context switching, and the Node.js CPU core is maxed out.

The effect is similar to the one mentioned in this thread.

I try the same using PHP and it finishes in 2-3 minutes. No drama.

Why?

Possible Causes

I currently believe this is either a Node.js socket issue, something going on with libev behind the scenes, or some other node-mongodb-native issue. I may be totally wrong, so I'm looking for a little guidance here.

As for other Node.js MongoDB adapters, I have tried Mongolian and it appears to queue documents in order to batch insert them, and it ends up running out of memory. So that's out. (Side note: I have no idea why on this, either, since it doesn't even come close to my 16 GB box limit--but I haven't bothered investigating much further on that.)

I should probably mention that I did in fact test a master/worker cluster with 4 workers (on a quad-core machine) and it finished in 2-3 minutes.

The Code

Here's my Node.js CoffeeScript program:

mongodb = require "mongodb"
microtime = require "microtime"
crypto = require "crypto"

times = 1000000
server = new mongodb.Server "127.0.0.1", 27017
db = mongodb.Db "test", server
db.open (error, client) ->
  throw error if error?

  collection = mongodb.Collection client, "foo"

  for i in [0...times]
    console.log "Inserting #{i}..." if i % 100000 == 0

    hash = crypto.createHash "sha1"
    hash.update "" + microtime.now() + (Math.random() * 255 | 0)
    key = hash.digest "hex"

    doc =
      key: key,
      foo1: 1000,
      foo2: 1000,
      foo3: 1000,
      bar1: 2000,
      bar2: 2000,
      bar3: 2000,
      baz1: 3000,
      baz2: 3000,
      baz3: 3000

    collection.insert doc, safe: true, (error, response) ->
      console.log error.message if error

And here's the roughly equivalent PHP program:

<?php
$mongo = new Mongo();
$collection = $mongo->test->foo;

$times = 1000000;
for ($i = 0; $i < $times; $i++) {
    if ($i % 100000 == 0) {
        print "Inserting $i...\n";
    }

    $doc = array(
        "key" => sha1(microtime(true) + rand(0, 255)),
        "foo1" => 1000,
        "foo2" => 1000,
        "foo3" => 1000,
        "bar1" => 2000,
        "bar2" => 2000,
        "bar3" => 2000,
        "baz1" => 3000,
        "baz2" => 3000,
        "baz3" => 3000
    );
    try {
        $collection->insert($doc, array("safe" => true));
    } catch (MongoCursorException $e) {
        print $e->getMessage() . "\n";
    }
}

回答1:

It sounds like you're running into the default heap limit in V8. I wrote a blog post about removing this limitation.

The garbage collector is probably going crazy and chewing on CPU, since it will constantly execute until you're under the 1.4GB limit.



回答2:

What happens if you explicitly return a value at the end of the db.open callback function? Your generated javascript code is pushing all of your collection.insert returns onto a big "_results" array, which would get slower and slower, I imagine.

db.open(function(error, client) {
  var collection, doc, hash, i, key, _i, _results;
  if (error != null) {
    throw error;
  }
  collection = mongodb.Collection(client, "foo");
  _results = [];
  for (i = _i = 0; 0 <= times ? _i < times : _i > times; i = 0 <= times ? ++_i : --_i) {
    ...
    _results.push(collection.insert(doc, {
      safe: true
    }, function(error, response) {
      if (error) {
        return console.log(error.message);
      }
    }));
  }
  return _results;
});

Try adding this at the end of your coffeescript:

    collection.insert doc, safe: true, (error, response) ->
      console.log error.message if error

  return

*Update: * So, I actually tried to run your program, and noticed a few more issues:

The biggest problem is you're trying to spawn a million inserts in a synchronous fashion, which will really kill your RAM, and eventually stop inserting (at least, it did for me). I killed it at 800MB RAM or so.

You need to change the way you're calling collection.insert() so that it works asynchronously.

I rewrote it like so, breaking out a couple of functions for clarity:

mongodb = require "mongodb"
microtime = require "microtime"
crypto = require "crypto"

gen  = () ->
  hash = crypto.createHash "sha1"
  hash.update "" + microtime.now() + (Math.random() * 255 | 0)
  key = hash.digest "hex"

  key: key,
  foo1: 1000,
  foo2: 1000,
  foo3: 1000,
  bar1: 2000,
  bar2: 2000,
  bar3: 2000,
  baz1: 3000,
  baz2: 3000,
  baz3: 3000

times = 1000000
i = times

insertDocs = (collection) ->
  collection.insert gen(), {safe:true}, () ->
    console.log "Inserting #{times-i}..." if i % 100000 == 0
    if --i > 0
      insertDocs(collection)
    else
      process.exit 0
  return

server = new mongodb.Server "127.0.0.1", 27017
db = mongodb.Db "test", server
db.open (error, db) ->
  throw error if error?
  db.collection "foo", (err, collection) ->
    insertDocs(collection)
    return
  return

Which finished in ~3 minutes:

wfreeman$ time coffee mongotest.coffee
Inserting 0...
Inserting 100000...
Inserting 200000...
Inserting 300000...
Inserting 400000...
Inserting 500000...
Inserting 600000...
Inserting 700000...
Inserting 800000...
Inserting 900000...

real    3m31.991s
user    1m55.211s
sys 0m23.420s

Also, it has the side benefit of using <100MB of RAM, 70% CPU on node, and 40% CPU on mongod (on a 2 core box, so it looks like it wasn't maxing out the CPU).