I am looking for advice on how to get efficient and high performance asynchronous IO working for my application that runs on Ubuntu Linux 14.04.
My app processes transactions and creates a file on disk/flash. As the app is progressing through transactions additional blocks are created that must be appended to the file on disk/flash. The app needs also to frequently read blocks of this file as it is processing new transactions. Each transaction might need to read a different block from this file in addition to also creating a new block that has to be appended to this file. There is an incoming queue of transactions and the app can continue to process transactions from the queue to create a deep enough pipeline of IO ops to hide the latency of read accesses or write completions on disk or flash. For a read of a block (which was put in the write queue by a previous transaction) that has not yet been written to disk/flash, the app will stall until the corresponding write completes.
I have an important performance objective – the app should incur the lowest possible latency to issue the IO operation. My app takes approximately 10 microseconds to process each transaction and be ready to issue a write to or a read from the file on disk/flash. The additional latency to issue an asynchronous read or write should be as small as possible so that the app can complete processing each transaction at a rate as close to 10 usecs per transaction as possible, when only a file write is needed.
We are experimenting with an implementation that uses io_submit to issue write and read requests. I would appreciate any suggestions or feedback on the best approach for our requirement. Is io_submit going to give us the best performance to meet our objective? What should I expect for the latency of each write io_submit and the latency of each read io_submit?
Using our experimental code (running on a 2.3 GHz Haswell Macbook Pro, Ubuntu Linux 14.04), we are measuring about 50 usecs for a write io_submit when extending the output file. This is too long and we aren't even close to our performance requirements. Any guidance to help me launch a write request with the least latency will be greatly appreciated.
I speak as an author of proposed Boost.AFIO here.
Firstly, Linux KAIO (io_submit) is almost always blocking unless O_DIRECT is on and no extent allocation is required, and if O_DIRECT is on you need to be reading and writing 4Kb multiples on 4Kb aligned boundaries, else you force the device to do a read-modify-write. You therefore will gain nothing using Linux KAIO unless you rearchitect your application to be O_DIRECT and 4Kb aligned i/o friendly.
Secondly, never ever extend an output file during a write, you force an extent allocation and possibly a metadata flush. Instead fallocate the file's maximum extent to some suitably large value, and keep an internal atomic counter of the end of file. That should reduce the problem to just extent allocation which for ext4 is batched and lazy - more importantly you won't be forcing a metadata flush. That should mean KAIO on ext4 will be async most of the time, but unpredictably will synchronise as it flushes delayed allocations to the journal.
Thirdly, the way I'd probably approach your problem is to use atomic append (O_APPEND) without O_DIRECT nor O_SYNC, so what you do is append updates to an ever growing file in the kernel's page cache which is very fast and concurrency safe. You then, from time to time, garbage collect what data in the log file is stale and whose extents can be deallocated using fallocate(FALLOC_FL_PUNCH_HOLE) so physical storage doesn't grow forever. This pushes the problem of coalescing writes to storage onto the kernel where much effort has been spent on making this fast, and because it's an always forward progress write you will see writes hit physical storage in a fairly close order to the sequence they were written which makes power loss recovery straightforward. This latter option is how databases do it and indeed journalling filing systems do it, and despite the likely substantial redesign of your software you'll need to do this algorithm has been proven the best balance of latency to durability in a general purpose problem case.
In case all the above seems like a lot of work, the OS already provides all of the three techniques rolled together into a highly tuned implementation which is better known as memory maps: 4Kb aligned i/o, O_DIRECT, never extending the file, all async i/o. On a 64 bit system, simply fallocate the file to a very large amount and mmap it into memory. Read and write as you see fit. If your i/o patterns confuse the kernel page algorithms which can happen, you may need a touch of madvise() here and there to encourage better behaviour. Less is more with madvise(), trust me.
An awful lot of people try to duplicate mmaps using various O_DIRECT algorithms without realising mmaps already can do everything you need. I'd suggest exploring those first, if Linux won't behave try FreeBSD which has a much more predictable file i/o model, and only then delve into the realm of rolling your own i/o solution. Speaking as someone who does these all day long, I'd strongly recommend you avoid them whenever possible, filing systems are pits of devils of quirky and weird behaviour. Leave the never ending debugging to someone else.
Linux AIO is something of a black art where experienced practitioners know the gotchas but for some reason it's taboo to talk to someone about the gotchas they don't already know. From scratching around on the web and experience I've come up with a few examples where Linux's asynchronous I/O submission may become (silently) synchronous (thereby turning io_submit()
into a blocking call):
- You're submitting buffered (aka non-direct) I/O. You're at the mercy of Linux's caching and your submit can go synchronous when what you're requesting isn't already in the read cache/write cache is full and the new request can't be accepted until writeback is completed.
- You asked for direct I/O to a file in a filesystem but for whatever reason the filesystem decides to ignore the
O_DIRECT
"hint" (e.g. how you submitted the I/O didn't meet O_DIRECT
alignment constraints, filesystem or particular filesystem's configuration doesn't support O_DIRECT
) and silently performs buffered I/O instead, resulting in the case above.
- You're doing direct I/O to a file in a filesystem but the filesystem has to do an synchronous operation (such as updating metadata) in order to fulfill the I/O. Some filesystems such as XFS try harder to provide good AIO behaviour in comparison to others but even there a user still has to be very careful so as to avoid operations that will trigger synchrony.
- You're submitting too much outstanding I/O. Your disk/disk controller will have a maximum number of I/O requests that can be processed at the same time. There are maximum AIO request queue sizes for each specific device (see the
/sys/block/[disk]/queue/nr_requests
documentation and the un(der) documented /sys/block/[disk]/device/queue_depth
) AND a system global maximum number of AIO requests (see the /proc/sys/fs/aio-max-nr
documentation) within the kernel. Making I/O back-up and exceed the size of the kernel queues leads to blocking.
- A sub-point is: if you submit I/Os that are "too large" (i.e. bigger than
/sys/block/[disk]/queue/max_sectors_kb
) they will be split up within the kernel and go on to chew up more than one request...
- A layer in the Linux block device stack between you and the submission to the disk has to block. For example, things like Linux software RAID (md) can make I/O requests passing through it stall while it updates its RAID 1 metadata on individual disks.
- Your submission causes the kernel to wait because:
- It needs to take a particular lock that is in use.
- It needs to allocate some extra memory or page something in.
The list above is not exhaustive.
A glimmer of hope for the future is a proposed feature that allows a program to request better behaviour by setting a flag which causes AIO submission to fail with EAGAIN if it would go on to block. At the time of writing said AIO EAGAIN patches were submitted against the 4.13 kernel. Even if those patches go into a future kernel you would still a) need said (or later) kernel to use that feature and b) have to be aware of the cases it doesn't cover (although I notice there are completely separate patches being proposed to try and return EAGAIN when stacked blocked devices would triggered blocking).
References:
- The AIOUserGuide has a "Performance considerations" section warning about
io_submit()
blocking/slowness situations.
- A good list of Linux AIO pitfalls is given in the "Performance issues" section of the README for the ggaoed AoE target.
- The "sleeps and waits during io_submit" XFS mailing list thread hints at some AIO constraints with filesystems.
- The "[PATCH 1/1 linux-next] ext4: add compatibility flag check to the patch" LKML mailing list thread has a reply from Ext4 lead dev Ted Ts'o talking about how filesystems can fallback to buffered I/O for
O_DIRECT
rather than failing the open()
call.
- In an LKML thread BTRFS lead dev Chris Mason states BTRFS resorts to buffered I/O when
O_DIRECT
is requested on compressed files.
- ZFS on Linux was changed from erroring on
O_DIRECT
to "supporting" it via buffered I/O (see point 3). There's further discussion from the lead up to the commitin the ZFS on Linux "Direct IO" GitHub issue.
- The Ext4 wiki has a warning that certain Linux implementations (Which?) fall back to buffered I/O when doing
O_DIRECT
allocating writes.
- The 2004 by the Linux Scalability Effort titled "Kernel Asynchronous I/O (AIO) Support" has a list of things that worked and things that did not work with Linux AIO (a bit old but a quick reference).
Related:
- Linux AIO: Poor Scaling
- io_submit() blocks until a previous operation will be completed
- buffered asynchronous file I/O on linux (but stick to the bits explicitly talking about Linux kernel AIO)
Hopefully this post helps someone (and if does help you could you upvote it? Thanks!).