Consider an application that is CPU bound, but also has high-performance I/O requirements.
I'm comparing Linux file I/O to Windows, and I can't see how epoll will help a Linux program at all. The kernel will tell me that the file descriptor is "ready for reading," but I still have to call blocking read() to get my data, and if I want to read megabytes, it's pretty clear that that will block.
On Windows, I can create a file handle with OVERLAPPED set, and then use non-blocking I/O, and get notified when the I/O completes, and use the data from that completion function. I need to spend no application-level wall-clock time waiting for data, which means I can precisely tune my number of threads to my number of cores, and get 100% efficient CPU utilization.
If I have to emulate asynchronous I/O on Linux, then I have to allocate some number of threads to do this, and those threads will spend a little bit of time doing CPU things, and a lot of time blocking for I/O, plus there will be overhead in the messaging to/from those threads. Thus, I will either over-subscribe or under-utilize my CPU cores.
I looked at mmap() + madvise() (WILLNEED) as a "poor man's async I/O" but it still doesn't get all the way there, because I can't get a notification when it's done -- I have to "guess" and if I guess "wrong" I will end up blocking on memory access, waiting for data to come from disk.
Linux seems to have the starts of async I/O in io_submit, and it seems to also have a user-space POSIX aio implementation, but it's been that way for a while, and I know of nobody who would vouch for these systems for critical, high-performance applications.
The Windows model works roughly like this:
- Issue an asynchronous operation.
- Tie the asynchronous operation to a particular I/O completion port.
- Wait on operations to complete on that port
- When the I/O is complete, the thread waiting on the port unblocks, and returns a reference to the pending I/O operation.
Steps 1/2 are typically done as a single thing. Steps 3/4 are typically done with a pool of worker threads, not (necessarily) the same thread as issues the I/O. This model is somewhat similar to the model provided by boost::asio, except boost::asio doesn't actually give you asynchronous block-based (disk) I/O.
The difference to epoll in Linux is that in step 4, no I/O has yet happened -- it hoists step 1 to come after step 4, which is "backwards" if you know exactly what you need already.
Having programmed a large number of embedded, desktop, and server operating systems, I can say that this model of asynchronous I/O is very natural for certain kinds of programs. It is also very high-throughput and low-overhead. I think this is one of the remaining real shortcomings of the Linux I/O model, at the API level.
The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit().
Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation.
The real answer is in:
io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)
Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:
http://man7.org/linux/man-pages/man7/aio.7.html
this implementation hasn't yet matured to the point where the POSIX
AIO implementation can be completely reimplemented using the kernel
system calls.
So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().
As explained in:
http://code.google.com/p/kernel/wiki/AIOUserGuide
and here:
http://www.ibm.com/developerworks/library/l-async/
Linux does provide async block I/O at the kernel level, APIs as follows:
aio_read Request an asynchronous read operation
aio_error Check the status of an asynchronous request
aio_return Get the return status of a completed asynchronous request
aio_write Request an asynchronous operation
aio_suspend Suspend the calling process until one or more asynchronous requests have completed (or failed)
aio_cancel Cancel an asynchronous I/O request
lio_listio Initiate a list of I/O operations
And if you asked who are the users of these API, it is the kernel itself - just a small subset is shown here:
./drivers/net/tun.c (for network tunnelling):
static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
./drivers/usb/gadget/inode.c:
ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
./net/socket.c (general socket programming):
static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,
./mm/filemap.c (mmap of files):
generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
./mm/shmem.c:
static ssize_t shmem_file_aio_read(struct kiocb *iocb,
etc.
At the userspace level, there is also the io_submit() etc API (from glibc), but the following article offer an alternative to using glibc:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
It directly implement the API for functions like io_setup() as direct syscall (bypassing glibc dependencies), a kernel mapping via the same "__NR_io_setup" signature should exist. Upon searching the kernel source at:
http://lxr.free-electrons.com/source/include/linux/syscalls.h#L474 (URL is applicable for the latest version 3.13) you are greeted with the direct implementation of these io_*() API in the kernel:
474 asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t __user *ctx);
475 asmlinkage long sys_io_destroy(aio_context_t ctx);
476 asmlinkage long sys_io_getevents(aio_context_t ctx_id,
481 asmlinkage long sys_io_submit(aio_context_t, long,
483 asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,
The later version of glibc should make these usage of "syscall()" to call sys_io_setup() unnecessary, but without the latest version of glibc, you can always make these call yourself if you are using the later kernel with these capabilities of "sys_io_setup()".
Of course, there are other userspace option for asynchronous I/O (eg, using signals?):
http://personal.denison.edu/~bressoud/cs375-s13/supplements/linux_altIO.pdf
or perhap:
What is the status of POSIX asynchronous I/O (AIO)?
"io_submit" and friends are still not available in glibc (see io_submit manpages), which I have verified in my Ubuntu 14.04, but this API is linux-specific.
Others like libuv, libev, and libevent are also asynchronous API:
http://nikhilm.github.io/uvbook/filesystem.html#reading-writing-files
http://software.schmorp.de/pkg/libev.html
http://libevent.org/
All these API aimed to be portable across BSD, Linux, MacOSX, and even Windows.
In terms of performance I have not seen any numbers, but suspect libuv may be the fastest, due to its lightweightedness?
https://ghc.haskell.org/trac/ghc/ticket/8400
For network socket i/o, when it is "ready", it don't block. That's what the O_NONBLOCK
and "ready" means.
For disk i/o, we have posix aio, linux aio, sendfile and friends.