unbuffered I/O in Linux

2019-03-25 18:46发布

问题:

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.

This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.

On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.

is there a similar way to do this in Linux?

回答1:

The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:

   O_DIRECT (Since Linux 2.4.10)
          Try to minimize cache effects of the I/O to and from this file.  In
          general this will degrade performance, but it is useful in special
          situations, such as when applications do their own caching.  File I/O
          is done directly to/from user space buffers.  The O_DIRECT flag on its
          own makes at an effort to transfer data synchronously, but does not
          give the guarantees of the O_SYNC that data and necessary metadata are
          transferred.  To guarantee synchronous I/O the O_SYNC must be used in
          addition to O_DIRECT.  See NOTES below for further discussion.

          A semantically similar (but deprecated) interface for block devices is
          described in raw(8).

Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.



回答2:

You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).

Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.

It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).

So in short

  • after every (significant chunk) of writes,
  • Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
  • This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.

Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.

However, like any optimisation

a) You're not going to need it so

b) Do not do it (yet)



回答3:

as my program runs the amount of free memory on the machine drops very quickly

Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.

If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.

It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.

Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.