I am storing many chunks of base64 encoded 64-bit doubles in an XML file. The double data all looks similar.
The double data is currently being compressed using the java 'Deflate' algorithm before the encoding, however each chunk of binary data in the file will have its own deflate data dictionary, which is an overhead I would like to greatly lessen. The 'Deflater' class has a 'setDictionary' method which I would like to use.
So questions are:
1). Does anyone have any suggestions for how to best build my own single custom data dictionary based on multiple sections of doubles (x8 bytes) that could he used for multiple deflate operations, i.e. use the same dictionary for all the compressions? Should I be looking for common bytes across all byte arrays, with the commonest byte put at the end of the dictionary array?
2). Can I separate the (custom) data dictionary from the deflated data, and then set the dictionary against the deflated data later before inflating the data again?
3). Will the deflate algorithm take my custom data dictionary, and then just create its own slightly different data dictionary anyway, both invalidating my singular data dictionary and lessening the potential space saving from using a singular data dictionary?
4). Can someone elaborate on the structure of zlib compressed data, so that I myself may try to separate the data dictionary from the compressed data?
I want to only use space for the data dictionary once in my file, and use it for each block of my double data in my filebut not store it with the double data. If the data dictionary cannot be separated from the deflated data/stored separately, then it seems that there would be little value in building a custom singular dictionary as each compressed block would have its own dictionary anyway. Is this right?
You can either set a fixed dictionary that consists of strings that are common and frequent in your data, or you can use the last n chunks concatenated as a dictionary. Either way, both the compression and decompression ends need the same dictionary to work with on any given chunk.
The dictionary is not sent with the data. That's the whole point. The other side needs to know the dictionary that was used in order to decompress, using some approach like those in #1.
The dictionary deflate uses has no structure. At any point in time, you are using the previous 32K of uncompressed data as the dictionary within which to search for matching strings starting at the next byte after that 32K. Setting the dictionary allows the compressor to get a head start for the first 32K of data. That's all there is to it.
The "dictionary" is in the compressed data simply as what you get when you decompress.