What does it take to be durable on Linux?

2020-08-25 05:30发布

问题:

I'm writing some software to deal with pretty critical data, and need to know what exactly I need to do to achieve durability.

Everywhere I look is contradictory information, so I'd appreciate any insight.

There are three ways I write to disk.

  • Using O_DIRECT | O_DSYNC, and pread'ing and then pwrite'ing 512 byte - 16 MB blocks.

  • Using O_DIRECT, pread'ing and then pwrite'ing 512 byte blocks, and calling fdatasync as regularly as necessary.

  • Using a memory mapped file, which I call msync(..., MS_SYNC | MS_INVALIDATE) for as regularly as necessary.

And this is all on ext4 with default flags.

For all of these, is it possible for data to be lost (after the write or sync has returned) or corrupted by a power failure, panic, crash, or anything else?

Is it possible that if my server dies mid pwrite, or between the beginning of pwrite and the end of fdatasync, or between the mapped memory being altered and msync, I'll have a mix of old and new data, or will it be one or the other? I want my individual pwrite calls to be atomic and ordered. Is this the case? And is it the case if they're across multiple files? So if I write with O_DIRECT | O_DSYNC to A, then O_DIRECT | O_DSYNC to B, am I guaranteed that, no matter what happens, if the data is in B it's also in A?

Does fsync even guarantee that the data's written? This says not, but I don't know if things have changed since then.

Does the journalling of ext4 completely solve the issue of corrupt blocks that this SO answer says exist?

I'm currently growing files by calling posix_fallocate and then ftruncate. Are both of these necessary, and are they enough? I figured that ftruncate would actually initialise the allocated blocks to avoid these issues.

To add confusion to the mix, I'm running this on EC2, I don't know if that affects anything. Although it makes it very hard to test as I can't control how aggressively it gets shut down.

回答1:

For all of these, is it possible for data to be lost (after the write or sync has returned) or corrupted by a power failure, panic, crash, or anything else?

Absolutely.

Does fsync even guarantee that the data's written? This says not, but I don't know if things have changed since then.

No. The answer is device dependent and likely filesystem dependent. Unfortunately, that filesystem could be layers and layers above the "actual" storage device. (e.g. md, lvm, fuse, loop, ib_srp, etc).

Although it makes it very hard to test as I can't control how aggressively it gets shut down.

That's true. But you can probably still use an NMI or sysrq-trigger to create a pretty abrupt halt.



回答2:

(2018, many years after this question was first asked)

What does it take to be durable on Linux?

From reading your question I see you have a filesystem between you and the disk. So the question becomes:

What does it take to be durable using a Linux filesystem?

The best you can do (in the general filesystem and unspecified hardware case) is the "fsync dance" which goes something like this:

preallocate_file(tmp);fsync(tmp);fsync(dir);rename(tmp, normal);fsync(normal);fsync(dir);

(shamelessly stolen from the comment Andres Freund (Postgres Developer) left on LWN) and you must check the return code of every call before proceeding to see if it succeeded and assume something went wrong if any return code returned non-zero. If you're using mmap then msync(MS_SYNC) is the equivalent of fsync.

A similar pattern to the above is mentioned on Dan Luu's "Files are hard" (which has a nice table about overwrite atomicity of various filesystems), the LWN article "Ensuring data reaches disk" and Ted Ts'o's "Don’t fear the fsync!".

For all of these [O_DIRECT | O_DSYNC, O_DIRECT + fdatasync, mmap + msync], is it possible for data to be lost (after the write or sync has returned) or corrupted by a power failure, panic, crash, or anything else?

Yes you could have unnoticed corruption because "allocating writes" due to growing the file past its current bounds can cause metadata operations and you are not checking for metadata durability (only data durability).

if my server dies mid pwrite, or between the beginning of pwrite and the end of fdatasync, or between the mapped memory being altered and msync, I'll have a mix of old and new data, [etc.]

As the state of the data is undefined in the case of interrupted overwrites it could be anything...

I want my individual pwrite calls to be atomic and ordered. Is this the case?

Between fsync's reordering could occur (e.g. if O_DIRECT silently fell back to buffering).

case if they're across multiple files?

You're in even more trouble. To cover this you would need to be writing your own journal and probably using file renames.

if I write with O_DIRECT | O_DSYNC to A, then O_DIRECT | O_DSYNC to B,

No.

Does fsync even guarantee that the data's written?

Yes It is necessary (if not sufficient) to determine the above (with modern Linux and a truthful disk stack assuming no bugs).

Does the journalling of ext4 completely solve the issue of corrupt blocks

No.

(ETOOMANYQUESTIONS)

Yes the Linux software stack could be buggy (2019: see the addendum below) or the hardware could be buggy (or lie in a way it can't back up) but that doesn't stop the above being the best you can do if everything lives up to its end of the bargain on a POSIX filesystem. If you know you have a particular OS with a particular filesystem (or no filesystem) and a particular hardware setup then it is true you may be able to reduce the need for some of the above but in general you should not skip any step.

Bonus answer: O_DIRECT alone cannot guarantee durability when used with filesystems (an initial issue would be "how do you know metadata has been persisted?"). See "Clarifying Direct IO's Semantics" in the Ext4 wiki for discussion on this point.

Addendum (March 2019)

Even with the current (at the time of writing 5.0) Linux kernel fsync doesn't always see error notifications and kernels before 4.16 were even worse. The PostgreSQL folks found that notification of errors can be lost and unwritten pages marked as clean leading to a case where fsync returns success even though there was a (swallowed) error asynchronously writing back the data (most Linux filesystems don't reliably keep dirty data around once a failure has happened so repeatedly "retrying" a failed fsync doesn't necessarily indicate what you might expect). See the PostgreSQL Fsync Errors wiki page the LWN PostgreSQL's fsync() surprise article and the talk How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it from FOSDEM 2019 for details.

So the post credits conclusion is it's complicated:

  • The fsync dance is necessary (even if it's not always sufficient) to at least cover the non-buggy I/O stack case
  • If you do your (write) I/O via direct I/O you will be able to get accurate errors when a write goes wrong
  • Earlier (older than 4.16) kernels were buggy when it came to time to get errors via fsync

Also see:

  • Writing programs to cope with I/O errors causing lost writes on Linux
  • The explanation and links within https://github.com/commercialhaskell/rio/issues/87