Performance issues with hard disk reading

2019-04-10 10:44发布

问题:

I have a C++ program which reads files from the hard disk and does some processing on the data in the files. I am using standard Win32 APIs to read the files. My problem is that this program is blazingly fast some times and then suddenly slows down to 1/6th of the previous speed. If I read the same files again and again over multiple runs, then normally the first run will be the slowest one. Then it maintains the speed until I read some other set of files. So my obvious guess was to profile the disk access time. I used perfmon utility and measured the IO Read Bytes/sec for my program. And as expected there was a huge difference (~ 5 times) in the number of bytes read. My questions are:

(1). Does OS (Windows in my case) cache the recently read files somewhere so that the subsequent loads are faster?

(2). If I can guarantee that all the files I read reside in the same directory then is there any way I can place them in the hard disk so that my disk access time is faster?

Is there anything I can do for this?

回答1:

1) Windows does cache recently read files in memory. The book Windows Internals includes an excellent description of how this works. Modern versions of Windows also use a technology called SuperFetch which will try to preemptively fetch disk contents into memory based on usage history and ReadyBoost which can cache to a flash drive, which allows faster random access. All of these will increase the speed with which data is accessed from disk after the initial run.

2) Directory really doesn't affect layout on disk. Defragmenting your drive will group file data together. Windows Vista on up will automatically defragment your disk. Ideally, you want to do large sequential reads and minimize your writes. Small random accesses and interleaving writes with reads significantly hurts performance. You can use the Windows Performance Toolkit to profile your disk access.



回答2:

Your numbered questions seem to be answered already. If you're still wondering what you can do to improve hard drive read speed, here are some tips:

  • Read with the OS functions (e.g., ReadFile) rather than wrapper libraries (like iostreams or stdio) if possible. Many wrappers introduce more levels of buffering.
  • Read sequentially, and let Windows know you're going to read sequentially with the FILE_FLAG_SEQUENTIAL_SCAN flag.
  • If you're only going to read (and not write), be sure to open the file just for reading.
  • Read in chunks, not bytes or characters.
  • Ideally the chunks should be multiples of the disk's cluster size.
  • Read from the disc at cluster-aligned offsets.
  • Read to memory at page-boundaries. (If you're allocating a big chunk, it's probably page aligned.)
  • Advanced: If you can start your computation after reading just the beginning the file, then you can used overlapped I/O to try to parallelize the computation and the subsequent reads as much as possible.


回答3:

Yes, Windows (and most modern OS's) keep recently read file data in otherwise unused RAM so that if that file data is requested again in the near future it will already be available in RAM and disk access can be avoided.

As far as making disk access faster, you could try defragmenting your drive, but I wouldn't expect it to help too much. Drive access is just slow compared to RAM access, which is why RAM caching provides such a nice speedup.



回答4:

As a diagnostic test, can you accurately measure the time it takes to load the very first time?

Then take that to determine the transfer rate. Then you can take that transfer rate and compare that to what you get when running HD Tune. For what it's worth, I ran this myself and got 44.2 MB/s minimum, 87 MB/s average, 110 MB/s max read speeds with my Western Digital RE3 drive (one of the faster 7200 RPM SATA drives available).

The point of all this is to see if your own application is doing the best it can. In other words, aside from caching you can't really read the files any faster than what your hard drive is capable of. So if you're reaching that limit then there's nothing more to do.



回答5:

Also, make sure that you are not running out of memory during your tests. Run perfmon and monitor Memory > Available Bytes and PhysicalDisk > Disk Read Bytes/sec for the physical drive you are reading. Monitoring process' I/O is a good idea too. Keep in mind that the latter combines all I/O (network included).

You should expect 50 MB/s for sequential reads from a single average SATA drive. A couple of good striped serial SCSI drive will give you about 220 MB/s. If you are seeing available memory going to near zero, that would be your problem. If it stays flat after you did the first round of reading than it has something to do with your app.



回答6:

A Microsoft utility called contig can be used to defragment a single file on disk or to create a new unfragmented file.



回答7:

For the crazy answer, you could try formatting the drive such that you place your info on the fastest portion, and see if that helps any.

Tom's Hardware had a review on how that might be done.