Reusing compression dictionary

Is there a compression tool that will let you output its dictionary (or similar) separate from the compressed output such that the dictionary can be re-used on a subsequent compression? The idea would be to transfer the dictionary one time, or use a reference dictionary at a remote site, and make a compressed file even smaller to transfer.

I've looked at the docs for a bunch of common compression tools, and I can't really find one that supports this. But most common compression tools aren't straight dictionary compression.

Usage I imagined is:

compress_tool --dictionary compressed.dict -o compressed.data uncompressed
decompress_tool --dictionary compressed.dict -o uncompressed compressed.data

To expand on my use case, I have a binary 500MB file F I want to copy over a slow network. Compressing the file alone yields a size of 200MB, which is still larger than I'd like. However, both my source and destination have a file F' which is very similar to F, but sufficiently different that binary diff tools don't work well. I was thinking that if I compress F' on both sites and then re-use information about that compression to compress F on the source, I could possibly eliminate some information from the transfer that could be rebuilt on the destination using F'.

标签： linux compression

2条回答

小情绪 Triste *

2楼-- · 2019-07-29 07:02

I've created dicflate exactly for this purpose: https://github.com/hrobeers/dicflate

dicflate -d compressed.dict < uncompressed > compressed.data
dicflate -x -d compressed.dict < compressed.data > uncompressed

0人赞添加讨论(0) 举报

太酷不给撩

3楼-- · 2019-07-29 07:18

Preset dictionaries aren't really useful for files that size. They're great for small data (think compressing fields in a database, RPC queries/responses, snippets of XML or JSON, etc.), but for larger files like you have the algorithm builds up its own dictionary very quickly.

That said, it just so happens that I was playing with preset dictionaries in Squash fairly recently, and I do have some code which does pretty much what you're talking about for the zlib plugin. I'm not going to push it to master (I have a different API in mind if I decide to support preset dictionaries), but I've just pushed it to the 'deflate-dictionary-file' branch if you want to take a look. To compress, do something like

squash -ko dictionary-file=foo.dict -c zlib:deflate uncompressed compressed.deflate

To decompress,

squash -dko dictionary-file=foo.dict -c zlib:deflate compressed.deflate decompressed

AFAIK there is nothing in zlib which supports building a dictionary--you have to do that yourself. The zlib documentation describes the "format":

The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.

For testing I was using something like this (YMMV):

cat input | tr ' ' '\n' | sort | uniq -c | awk '{printf "%06d %s\n",$1,$2}' | sort | cut -b8- | tail -c32768

0人赞添加讨论(0) 举报

Reusing compression dictionary

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间