Does MongoDB journaling guarantee durability?

2019-01-24 06:05发布

问题:

Even if journaling is on, is there still a chance to lose writes in MongoDB?

"By default, the greatest extent of lost writes, i.e., those not made to the journal, are those made in the last 100 milliseconds."

This is from Manage Journaling, which indicates you could lose writes made since the last time the journal was flushed to disk.

If I want more durability, "To force mongod to commit to the journal more frequently, you can specify j:true. When a write operation with j:true is pending, mongod will reduce journalCommitInterval to a third of the set value."

Even in this case, it looks like flushing the journal to disk is asynchronous so there is still a chance to lose writes. Am I missing something about how to guarantee that writes are not lost?

回答1:

Posting a new answer to clean this up. I performed tests and read the source code again and I'm sure the irritation comes from an unfortunate sentence in the write concern documentation. With journaling enabled and j:true write concern, the write is durable, and there is no mysterious window for data loss.

Even if journaling is on, is there still a chance to lose writes in MongoDB?

Yes, because the durability also depends on the individual operations write concern.

"By default, the greatest extent of lost writes, i.e., those not made to the journal, are those made in the last 100 milliseconds."

This is from Manage Journaling, which indicates you could lose writes made since the last time the journal was flushed to disk.

That is correct. The journal is flushed by a separate thread asynchronously, so you can lose everything since the last flush.

If I want more durability, "To force mongod to commit to the journal more frequently, you can specify j:true. When a write operation with j:true is pending, mongod will reduce journalCommitInterval to a third of the set value."

This irritated me, too. Here's what it means:

When you send a write operation with j:true, it doesn't trigger the disk flush immediately, and not on the network thread. That makes sense, because there could be dozens of applications talking to the same mongod instance. If every application were to use journaling a lot, the db would be very slow because it's fsyncing all the time.

Instead, what happens is that the 'durability thread' will take all pending journal commits and flush them to disk. The thread is implemented like this (comments mine):

sleepmillis(oneThird); //dur.cpp, line 801
for( unsigned i = 1; i <= 2; i++ ) {
  // break, if any j:true write is pending
  if( commitJob._notify.nWaiting() )
    break;
  // or the number of bytes is greater than some threshold
  if( commitJob.bytes() > UncommittedBytesLimit / 2  )
    break;
  // otherwise, sleep another third
  sleepmillis(oneThird);
}

// fsync all pending writes                                      
durThreadGroupCommit();

So a pending j:true operation will cause the journal commit thread to commit earlier than it normally would, and it will commit all pending writes to the journal, including those that don't have j:true set.

Even in this case, it looks like flushing the journal to disk is asynchronous so there is still a chance to lose writes. Am I missing something about how to guarantee that writes are not lost?

The write (or the getLastError command) with a j:true journaled write concern will wait for the durability thread to finish syncing, so there's no risk of data loss (as far as the OS and hardware guarantee that).

The sentence "However, there is a window between journal commits when the write operation is not fully durable" probably refers to a mongod running with journaling enabled that accepts a write that does NOT use the j:true write concern. In that case, there's a chance of the write getting lost since the last journal commit.

I filed a docs bug report for this.



回答2:

Maybe. Yes, it waits for the data to be written, but according to the docs there's a 'there is a window between journal commits when the write operation is not fully durable', whatever that is. I couldn't find out what they refer to.

I'm leaving the edited answer here, but I reversed myself back-and-forth, so it's a bit irritating:


This is a bit tricky, because there are a lot of levers you can pull:

Your MongoDB setup

Assuming that journaling is activated (default for 64 bit), the journal will be committed in regular intervals. The default value for the journalCommitInterval is 100ms if the journal and the data files are on the same block device, or 30ms if they aren't (so it's preferable to have the journal on a separate disk).

You can also change the journalCommitInterval to as little as 2ms, but it will increase the number of write operations and reduce overall write performance.

The Write Concern

You need to specify a write concern that tells the driver and the database to wait until the data is written to disk. However, this won't wait until the data has been actually written to the disk, because that would take 100ms in a bad-case scenario with the default setup.

So, at the very best, there's a 2ms window where data can get lost. That's insufficient for a number of applications, however.

The fsync command forces a disk flush of all data files, but that's unnecessary if you use journaling, and it's inefficient.

Real-Life Durability

Even if you were to journal every write, what is it good for if the datacenter administrator has a bad day and uses a chainsaw on your hardware, or the hardware simply disintegrates itself?

Redundant storage, not on a block device level like RAID, but on a much higher level is a better option for many scenarios: Have the data in different locations or at least on different machines using a replica set and use the w:majority write concern with journaling enabled (journaling will only apply on the primary, though). Use RAID on the individual machines to increase your luck.

This offers the best tradeoff of performance, durability and consistency. Also, it allows you to adjust the write concern for every write and has good availability. If the data is queued for the next fsync on three different machines, it might still be 30ms to the next journal commit on any of the machines (worst case), but the chance of three machines going down within the 30ms interval is probably a millionfold lower than the chainsaw-massacre-admin scenario.

Evidence

TL;DR: I think my answer above is correct.

The documentation can be a little irritating, especially with regards to wtimeout, so I checked the source. I'm not an expert on the mongo source, so take this with a grain of salt:

In write_concern.cpp, we find (edited for brevity):

if ( cmdObj["j"].trueValue() ) {
    if( !getDur().awaitCommit() ) {
        // --journal is off
        result->append("jnote", "journaling not enabled on this server");
    } // ...
}
else if ( cmdObj["fsync"].trueValue() ) {
    if( !getDur().awaitCommit() ) {
        // if get here, not running with --journal
        log() << "fsync from getlasterror" << endl;
        result->append( "fsyncFiles" , MemoryMappedFile::flushAll( true ) );
    }

Note the call MemoryMappedFile::flushAll( true ) if fsync is set. This call is clearly not in the first branch. Otherwise, durability is handled on a sepate thread (relevant files prefixed dur_).

That explains what wtimeout is for: it refers to the time waiting for slaves, and has nothing to do with I/O or fsync on the server.



回答3:

Journaling is for keeping the data on a particular mongod in a consistent state, even in case of chainsaw madness, however with client settings through writeconcern it can be used to force out durability. About write concern DOCS.

There is an option, j:1, which you can read about here which ensures that the particular write operation waits for acknowledge till it is written to the journal file on disk (so not just in the memory map). However this docs says the opposite. :) I would vote for the first case it makes me feel more comfortable.

If you run lots of commands with such option mongodb will adapt the size of the commit interval of the journal to speed things up, you can read about it here: DOCS this one you also mentioned and as others already said that you can specify an interval between 2-300ms.

Durability is much more ensured in my opinion over the w:2 option while if the update/write operation is acknowledged by two members in a replicaset it is really unlikely to lose both in the same minute (datafile flush interval), but not impossible.

Using both options will cause the situation that when the operation is acknowledged by the database cluster it will reside in memory at two different boxes and on one it will be in a consistent recoverable disk place too.



回答4:

Generally lost writes are an issue in every system where there is buffering/caching/delayed-write involved between a system's runtime and a permanent (non-volatile) storage, even at the OS level (for example write-behind caching). So there is always a chance to lose writes, even if your concrete provider (MongoDB) provides functionality for transaction durability it's the underlying OS that is responsible for ultimately writing the data, and even then there is caching at the device level... And that's just the lower levels, making the system highly concurrent, distributed and performant only makes matters worse.

In short there is no absolute durability, only practical/eventual/hope-for-the-best durability especially with a NoSQL storage like Mongo, which isn't primarily made for consistency and durability in the first place.



回答5:

I would have to agree with Sammaye that journoualing has little to do with durability. However, if you want to get an answer to whether you can really trust mongodb to store your data with good consistency, then I would suggest that you read this blog post. There is a reply from 10gen regarding that post, and a reply from the author to the 10gen post. I would suggest that you read into it to make an educated decision. It took me some time to understand all the details on my own, but this post has the basics covered.

The response to the blog post was given here by 10gen, the company that makes mongodb.

And the response to the response was given by the professor on this post.

It explains a lot about how Mongodb can shard data, how it actually functions, and the performance hits it takes if you add on extra safety locks. I strongly want to say that these three writings are the best thing out there, and by far the most comprehensive things out there that talk about the benefits and drawbacks of mongodb, if you think its one sided, look at the comments, and also see what people had to say, because if something received a reply from the company that made the software, then it must have made some good points atleast.



标签: mongodb