-->

What is the best compression algorithm that allows

2019-01-22 07:30发布

问题:

What is the best compression algorithm that allows random reads/writes in a file?

I know that any adaptive compression algorithms would be out of the question.

And I know huffman encoding would be out of the question.

Does anyone have a better compression algorithm that would allow random reads/writes?

I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?

In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

回答1:

I am stunned at the number of responses that imply that such a thing is impossible.

Have these people never heard of "compressed file systems", which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?

I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.

Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details. But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.

Also check out: "StackOverflow: Compression formats with good support for random access within archives?"



回答2:

The razip format supports random access reads with better performance than gzip/bzip2 which have to be tweaked for this support:

http://sourceforge.net/projects/razip/



回答3:

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.

If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.



回答4:

I think Stephen Denne might be onto something here. Imagine:

  • zip-like compression of sequences to codes
  • a dictionary mapping code -> sequence
  • file will be like a filesystem
    • each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
    • "filesystem" keeps track of which "file" belongs to which bytes (start, end)
    • each "file" is compressed according to dictionary
    • reads work filewise, uncompressing and retrieving bytes according to "filesystem"
    • writes make "files" invalid, new "files" are appended to replace the invalidated ones
  • this system will need:
    • defragmentation mechanism of filesystem
    • compacting dictionary from time to time (removing unused codes)
  • done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually

One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.

This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.



回答5:

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.

e.g.
We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.

Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.

This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.



回答6:

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.

However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.

As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.

However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.



回答7:

No compression scheme will allow fine-grained random access, for two related reasons:

  • you can't know exactly how far into the compressed file your desired piece of data is, therefore
  • there is no way to know where a symbol starts (at what bit position for Huffman, worse for arithmetic coding).

I can only suggest treating the file like a broadcast stream and inserting frequent synchronization / position markers, with obvious overhead (the sync marks not only take up space themselves, but complicate the encoding because it has to avoid "accidental" sync marks!). Alternatively, and to avoid seeking being something like a binary search (with the optimization that you can take a better guess where to start than the middle), you could include a "table of contents" at the start or end of the file.

As for random-access writing... I can't think of any neat solution :(