Compression formats with good support for random a

2019-01-05 09:48发布

This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:

I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.

But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH option, which can be used during compression to insert "sync points" in the compressed output (inflateSync() can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip doesn't have an option for this, but I'm willing to write my own compression program if I must).

It seems from one source that even Z_FULL_FLUSH is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH also produces sync points but they are not usable for random access).

Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?

Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."

12条回答
beautiful°
2楼-- · 2019-01-05 10:43

bgzip can compress files in a gzip variant which is indexable (and can be decompressed by gzip). This is used in some bioinformatics applications, together with the tabix indexer.

See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.

I don't know to what extent it is adaptable to other applications.

查看更多
贪生不怕死
3楼-- · 2019-01-05 10:43

I'm not sure if this would be practical in your exact situation, but couldn't you just gzip each large file into smaller files, say 10 MB each? You would end up with a bunch of files: file0.gz, file1.gz, file2.gz, etc. Based on a given offset within the original large, you could search in the file named "file" + (offset / 10485760) + ".gz". The offset within the uncompressed archive would be offset % 10485760.

查看更多
够拽才男人
4楼-- · 2019-01-05 10:45

Take a look at dictzip. It is compatible with gzip and allows coarse random access.

An excerpt from its man page:

dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat --start] will make use of this data to perform pseudo-random access on the file.

I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.

Update:

I improved dictzip to have no file size limit. My implementation is under MIT license.

查看更多
Anthone
5楼-- · 2019-01-05 10:48

I dont know if its been mentioned yet, but the Kiwix project had done great work in this regard. Through their program Kiwix, they offer random access to ZIM file archives. Good compression, too. The project originated when there was a demand for offline copies of the Wikipedia (which has reached above 100 GB in uncompressed form, with all media included). They have successfully taken a 25 GB file (a single-file embodiment of the wikipedia without most of the media) and compressed it to a measly 8 GB zim file archive. And through the Kiwix program, you can call up any page of the Wikipedia, with all associated data, faster than you can surfing the net.

Even though Kiwix program is a technology based around the wikipedia database structure, it proves that you can have excellent compression ratios and random access simultaneously.

查看更多
做自己的国王
6楼-- · 2019-01-05 10:54

Two possible solutions:

  1. Let the OS deal with compression, create and mount a compressed file system (SquashFS, clicfs, cloop, cramfs, e2compr or whatever) containing all your text files and don't do anything about compression in your application program.

  2. Use clicfs directly on each text file (one clicfs per text file) instead of compressing a filesystem image. Think of "mkclicfs mytextfile mycompressedfile" being "gzip <mytextfile >mycompressedfile" and "clicfs mycompressedfile directory" as a way of getting random access to the data via the file "directory/mytextfile".

查看更多
可以哭但决不认输i
7楼-- · 2019-01-05 10:54

razip supports random access with better performance than gzip/bzip2 which have to be tweaked for this support - reducing compression at the expense of "ok" random access:

http://sourceforge.net/projects/razip/

查看更多
登录 后发表回答