Read sequential file - Compressed file vs Uncompre

2019-07-10 02:20发布

问题:

I am looking for the fastest way to read a sequential file from disk. I read in some posts that if I compressed the file using, for example, lz4, I could achieve better performance than read the flat file, because I will minimize the i/o operations.

But when I try this approach, scanning a lz4 compressed file gives me a poor performance than scanning the flat file. I didn't try the lz4demo above, but looking for it, my code is very similar.

I have found this benchmarks: http://skipperkongen.dk/2012/02/28/uncompressed-versus-compressed-read/ http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c?r=75

Is it really possible to improve performance reading a compressed sequential file over an uncompressed one? What am I doing wrong?

回答1:

Yes, it is possible to improve disk read by using compression.

This effect is most likely to happen if you use a multi-threaded reader : while one thread reads compressed data from disk, the other one decode the previous compressed block within memory.

Considering the speed of LZ4, the decoding operation is likely to finish before the other thread complete reading the next block. This way, you'll achieved a bandwidth improvement, proportional to the compression ratio of the tested file.

Obviously, there are other effects to consider when benchmarking. For example, seek times of HDD are several order of magnitude larger than SSD, and under bad circumstances, it can become the dominant part of the timing, reducing any bandwidth advantage to zero.



回答2:

It depends on the speed of the disk vs. the speed and space savings of decompression. I'm sure you can put this into a formula.

Is it really possible to improve performance reading a compresses sequential file over an uncompressed one? What am i doing wrong?

Yes, it is possible (example: a 1kb zip file could contain 1GB of data - it would most likely be faster to read and decompress the ZIP).

Benchmark different algorithms and their decompression speeds. There are compression benchmark websites for that. There are also special-purpose high-speed compression algorithms.

You could also try to change the data format itself. Maybe switch to protobuf which might be faster and smaller than CSV.