I'm writing some software to deal with pretty critical data, and need to know what exactly I need to do to achieve durability.
Everywhere I look is contradictory information, so I'd appreciate any insight.
There are three ways I write to disk.
Using O_DIRECT | O_DSYNC, and pread'ing and then pwrite'ing 512 byte - 16 MB blocks.
Using O_DIRECT, pread'ing and then pwrite'ing 512 byte blocks, and calling fdatasync as regularly as necessary.
Using a memory mapped file, which I call msync(..., MS_SYNC | MS_INVALIDATE) for as regularly as necessary.
And this is all on ext4 with default flags.
For all of these, is it possible for data to be lost (after the write or sync has returned) or corrupted by a power failure, panic, crash, or anything else?
Is it possible that if my server dies mid pwrite, or between the beginning of pwrite and the end of fdatasync, or between the mapped memory being altered and msync, I'll have a mix of old and new data, or will it be one or the other? I want my individual pwrite calls to be atomic and ordered. Is this the case? And is it the case if they're across multiple files? So if I write with O_DIRECT | O_DSYNC to A, then O_DIRECT | O_DSYNC to B, am I guaranteed that, no matter what happens, if the data is in B it's also in A?
Does fsync even guarantee that the data's written? This says not, but I don't know if things have changed since then.
Does the journalling of ext4 completely solve the issue of corrupt blocks that this SO answer says exist?
I'm currently growing files by calling posix_fallocate and then ftruncate. Are both of these necessary, and are they enough? I figured that ftruncate would actually initialise the allocated blocks to avoid these issues.
To add confusion to the mix, I'm running this on EC2, I don't know if that affects anything. Although it makes it very hard to test as I can't control how aggressively it gets shut down.
(2018, many years after this question was first asked)
From reading your question I see you have a filesystem between you and the disk. So the question becomes:
The best you can do (in the general filesystem and unspecified hardware case) is the "fsync dance" which goes something like this:
(shamelessly stolen from the comment Andres Freund (Postgres Developer) left on LWN) and you must check the return code of every call before proceeding to see if it succeeded and assume something went wrong if any return code returned non-zero. If you're using
mmap
thenmsync(MS_SYNC)
is the equivalent offsync
.A similar pattern to the above is mentioned on Dan Luu's "Files are hard" (which has a nice table about overwrite atomicity of various filesystems), the LWN article "Ensuring data reaches disk" and Ted Ts'o's "Don’t fear the fsync!".
Yes you could have unnoticed corruption because "allocating writes" due to growing the file past its current bounds can cause metadata operations and you are not checking for metadata durability (only data durability).
As the state of the data is undefined in the case of interrupted overwrites it could be anything...
Between
fsync
's reordering could occur (e.g. ifO_DIRECT
silently fell back to buffering).You're in even more trouble. To cover this you would need to be writing your own journal and probably using file renames.
No.
YesIt is necessary (if not sufficient) to determine the above (with modern Linux and a truthful disk stack assuming no bugs).No.
(ETOOMANYQUESTIONS)
Yes the Linux software stack could be buggy (2019: see the addendum below) or the hardware could be buggy (or lie in a way it can't back up) but that doesn't stop the above being the best you can do if everything lives up to its end of the bargain on a POSIX filesystem. If you know you have a particular OS with a particular filesystem (or no filesystem) and a particular hardware setup then it is true you may be able to reduce the need for some of the above but in general you should not skip any step.
Bonus answer:
O_DIRECT
alone cannot guarantee durability when used with filesystems (an initial issue would be "how do you know metadata has been persisted?"). See "Clarifying Direct IO's Semantics" in the Ext4 wiki for discussion on this point.Addendum (March 2019)
Even with the current (at the time of writing 5.0) Linux kernel
fsync
doesn't always see error notifications and kernels before 4.16 were even worse. The PostgreSQL folks found that notification of errors can be lost and unwritten pages marked as clean leading to a case wherefsync
returns success even though there was a (swallowed) error asynchronously writing back the data (most Linux filesystems don't reliably keep dirty data around once a failure has happened so repeatedly "retrying" a failedfsync
doesn't necessarily indicate what you might expect). See the PostgreSQL Fsync Errors wiki page the LWN PostgreSQL's fsync() surprise article and the talk How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it from FOSDEM 2019 for details.So the post credits conclusion is it's complicated:
fsync
dance is necessary (even if it's not always sufficient) to at least cover the non-buggy I/O stack casefsync
Also see:
Absolutely.
No. The answer is device dependent and likely filesystem dependent. Unfortunately, that filesystem could be layers and layers above the "actual" storage device. (e.g.
md
,lvm
,fuse
,loop
,ib_srp
, etc).That's true. But you can probably still use an NMI or
sysrq-trigger
to create a pretty abrupt halt.