Reusing compression dictionary

2019-07-29 06:25发布

Is there a compression tool that will let you output its dictionary (or similar) separate from the compressed output such that the dictionary can be re-used on a subsequent compression? The idea would be to transfer the dictionary one time, or use a reference dictionary at a remote site, and make a compressed file even smaller to transfer.

I've looked at the docs for a bunch of common compression tools, and I can't really find one that supports this. But most common compression tools aren't straight dictionary compression.

Usage I imagined is:

compress_tool --dictionary compressed.dict -o compressed.data uncompressed
decompress_tool --dictionary compressed.dict -o uncompressed compressed.data

To expand on my use case, I have a binary 500MB file F I want to copy over a slow network. Compressing the file alone yields a size of 200MB, which is still larger than I'd like. However, both my source and destination have a file F' which is very similar to F, but sufficiently different that binary diff tools don't work well. I was thinking that if I compress F' on both sites and then re-use information about that compression to compress F on the source, I could possibly eliminate some information from the transfer that could be rebuilt on the destination using F'.

2条回答
小情绪 Triste *
2楼-- · 2019-07-29 07:02

I've created dicflate exactly for this purpose: https://github.com/hrobeers/dicflate

dicflate -d compressed.dict < uncompressed > compressed.data
dicflate -x -d compressed.dict < compressed.data > uncompressed
查看更多
太酷不给撩
3楼-- · 2019-07-29 07:18

Preset dictionaries aren't really useful for files that size. They're great for small data (think compressing fields in a database, RPC queries/responses, snippets of XML or JSON, etc.), but for larger files like you have the algorithm builds up its own dictionary very quickly.

That said, it just so happens that I was playing with preset dictionaries in Squash fairly recently, and I do have some code which does pretty much what you're talking about for the zlib plugin. I'm not going to push it to master (I have a different API in mind if I decide to support preset dictionaries), but I've just pushed it to the 'deflate-dictionary-file' branch if you want to take a look. To compress, do something like

squash -ko dictionary-file=foo.dict -c zlib:deflate uncompressed compressed.deflate

To decompress,

squash -dko dictionary-file=foo.dict -c zlib:deflate compressed.deflate decompressed

AFAIK there is nothing in zlib which supports building a dictionary--you have to do that yourself. The zlib documentation describes the "format":

The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.

For testing I was using something like this (YMMV):

cat input | tr ' ' '\n' | sort | uniq -c | awk '{printf "%06d %s\n",$1,$2}' | sort | cut -b8- | tail -c32768
查看更多
登录 后发表回答