I have a (huge) set of similar data files. The set is constantly growing. The size of a single file is about 10K. Each file must be compressed on its own. The compression is done with the zlib library, which is used by the java.util.zip.Deflater
class. When passing a dictionary to the Deflate algorithm using setDictionary
, I can improve the compression ratio.
Is there a way (algorithm) to find the 'optimal' dictionary, i.e. a dictionary with the overall optimal compression ratio?
See zlib manual
John Reiser explained on comp.compression:
The zlib manual recommends to place the most common ocurrences at the end of the dictionary.
This is because LZ77 has a sliding window algorithm, so the later substrings will be reachable further on your stream of data than the first few.
I'd play with generating the dictionary with a higher level language with good support of strings. A crude JavaScript example:
Then dict is a string of 70 chars with:
You can try it copy-paste-run here (add: "print(dict)")
That's just whole words, not substrings. Also there are ways to overlap common substrings to save space on the dictionary.