Once I had the theory that on modern operating systems multithreaded
read access on the HDD should perform better.
I thought that:
the operating system queues all read requests,
and rearranges them in such a way, that it could read from the HDD more
sequentially. The more requests it would get, the better it could rearrange them
to optimize the read sequence.
I was very sure that I read it somewhere few times.
But I did some benchmarking, and had to find out, that multithreaded
read access mostly perform much worst, and never performs better.
I had the experience under Windows and Linux. I benchmarked pure
searching of files using the operating system's tools, and also
had written own little benchmarks.
Am I missing something?
Can someone explain to me the secrets of this topic?
Thank you!
Well apparently you're causing the read head to skip around all over the place. Your bottleneck is the disk, not the processor.
To re-phrase, the CPU might be parrallel but the disk isn't.
solution: use NCQ to boost the performance. to do so configure your SATA HDD controller to use AHCI.
additional details below:
i had made similar observations when analyzing a particular application. on my quad-core system i compared the following configurations:
- 1 core only: pretty fast
- 4 cores enabled: much slower! this was quite surprising and also confusing to me.
it turned out the application was doing heavy, concurrent HDD access. in case of multiple cores (and hence multiple threads) this would noticeably slow down total execution time.
i did some research and learned that a feature called NCQ (native command queuing) will do the optimization of HDD access you are referring to.
in SCSI world this has been common standard for quite a while. and in SATA world it has been adapted some time back.
to unlock this feature it's required to configure your HDD controller to operate in AHCI mode - this is a prerequisite to use NCQ!
as regular desktop systems nowadays use on-board HDD controllers, this configuration part needs to be done in BIOS setup. for SATA configuration you can usually choose between the following operational modes:
- compatible / legacy IDE
- AHCI
i went ahead and implemented my own custom benchmark to compare one and the same system running with the following configurations:
- 4 cores enabled, legacy IDE: pretty slow
- 4 cores enabled, AHCI / NCQ: much faster. particular benchmark sections performed 6 times faster!
--
conclusion:
to unleash the full power of systems with concurrent HDD access:
- switch over to AHCI (so you can utilize NCQ)
- don't use the generic AHCI drivers that come with the OS. instead, use the vendor-specific, optimized drivers. example: windows 7 comes with some generic AHCI drivers that support most of the common HDD controllers. however, when using an intel chipset, make sure to install the intel "matrix storage manager" or intel "rapid storage technology" (e.g. intel RST 11.7). optimized drivers have shown to additionally boost HDD performance.
not doing so will make some applications run slower when using multiple threads instead of a single thread. that's the surprising part you need to consider.
--
note: there's a myth out there that says: NCQ is only relevant for "server" environments (with hundreds of processes running in parallel).
my benchmark results are pointing in a different direction: it's also relevant for "desktop" environments. whenever heavy, concurrent HDD access is happening.
additional notes:
- some older chipsets / SATA HDD controllers do not support AHCI mode. but that's not covered here.
- some "old" OS need special actions when either installing in AHCI mode or migrating an already system from IDE mode to AHCI. but that's not covered here.
Whether or not you are seeing speedup will almost assuredly depend on the scenario you are looking at and the hardware. More details on your benchmarking methodology would be useful here.
At a coarse level, the opportunity for a speedup arises when you're not utilizing the maximum throughput of the i/o controller and it's caches or when you are overlapping i/o with CPU intensive work and they are blocked waiting for each other.
Are you comparing doing reads of multiple small files spread out across the system, or just reading a few large files sequentially? You'll see different performance characteristics here.
Have you profiled with a good systems profiler like the (free) windows performance toolkit to see what is going on in your benchmarks? This is practically a must.
These kind of benchmarks can be a lot of fun to write and profile, don't let a few false starts get in the way of digging in and looking for speedups.
-Rick
I think your assumption about the OS optimizing concurrent disk access is simply false. I imagine it does this sort of re-ordering when you use scatter/gather I/O from a single thread, but there's no practical way for it to optimize concurrent requests in this way. Any such scheme would introduce unnecessary latency in single-threaded reads. (The OS would have to wait a bit just in case a concurrent request came in.) Anyway, the short answer is that your concurrent requests are causing the read heads to jump all over the place. The OS cannot optimize this away.
I think you are talking about native command queuing, which may or may not be enabled on the system you are testing with. From the Wikipedia entry:
In fact, newer mainstream Linux kernels support AHCI natively. Windows XP requires the installation of a vendor-specific driver even if AHCI is present on the host bus adapter. Windows Vista natively supports both AHCI and NCQ. FreeBSD fully supports AHCI and NCQ since version 8.0.
Also, I haven't done any tests, but NCQ may not be that effective for a directory walk that has to access small files/inodes all over the disk. It could be that the disk controller is able to service each request fast enough that a queue is never built up to reorder, thus you don't see any benefit.
It's probably important here that you split the reading of the directory or file information away from the processing of that information. In other words, disk IO in one thread, processing and searching in another. Pass completed IO information to the processing thread with a bounded queue. By doing this you'll ensure that your IO thread is never waiting on the processing of results before getting busy on the read of the next block of data to process.