Are there any lossless compression methods that can be applied to floating point time-series data, and will significantly outperform, say, writing the data as binary into a file and running it through gzip?
Reduction of precision might be acceptable, but it must happen in a controlled way (i.e. I must be able to set a bound on how many digits must be kept)
I am working with some large data files which are series of correlated double
s, describing a function of time (i.e. the values are correlated). I don't generally need the full double
precision but I might need more than float
.
Since there are specialized lossless methods for images/audio, I was wondering if anything specialized exists for this situation.
Clarification: I am looking for existing practical tools rather than a paper describing how to implement something like this. Something comparable to gzip in speed would be excellent.
You might want to have a look at these resources:
You might also want to try Logluv-compressed TIFF for this, thought I haven't used them myself.
One technique that the HDF5 people use is "shuffling", where you group each byte for N floating point values together. This is more likely to give you repetitive sequences of bytes which will compress better with gzip, for example.
A second method I have found which greatly reduces the size of compressed gzipped data is to first convert the data to the float16 (half-precision) format and back again to float32. This produces a lot of zeros in the output stream which can shrink file sizes by around 40-60 per cent after compression. One subtlety is that the maximum float16 value is rather low, so you may want to scale your data first, e.g. in python
Some tests suggest that the mean absolute fractional difference between input and output for some data are around 0.00019 with a maximum of 0.00048. This is in line with the 2**11 precision of the mantissa.
Here are some ideas if you want to create your own simple algorithm:
You can use Holt's Exponential smoothing algorithm (which is prediction based compression algorithm). Initially assign some weight to the data and predict the next value.If both the data are same,it produces many zero's in the MSB by doing XOR operation
Since you state that you need a precision somewhere between 'float' and 'double': you can zero out any number of least significant bits in single- and double-precision floats. IEEE-754 floating point numers are represented binary roughly like
seeefffffffff
, which represents the valuesign*1.fffffff*2^(eee).
You can zero out the least significant fraction (f) bits. For single-precision (32-bit) floats, there are 23 fraction bits of which you can zero out up to 22. For double-precision (64-bit), it's 52 and up to 51. (If you zero out all bits, then the special values NaN and +/-inf will be lost).
Especially if the data represents decimal values such as 1.2345, this will help in data compression. That is because 1.2345 cannot be represented exactly as a binary floating point value, but rather as
0x3ff3c083126e978d
, which is not friendly to data compression. Chopping off the least significant 24 bits will result in0x3ff3c08312000000
, which is still accurate to about 9 decimal digits (in this example, the difference is 1.6e-9).If you do this on the raw data, and then store the differences between subsequential numbers, it will be even more compression-friendly (via gzip) if the raw data varies slowly.
Here is an example in C:
And one in python/numpy:
Possible methods that can be used for floating-point compression:
Transpose 4xN for float and 8xN for double + lz77
Implementation: Floating point compression in TurboTranspose
see also error-bounded lossy compression
Predictor (ex. Finite Context Method) + encoding (ex. "integer compression").
Implementation: Floating point compression in TurboPFor
when possible, convert all floating point numbers to integers (ex. 1.63 -> 163),
then use integer compression
Implementation: Integer compression
You can test all these methods, with your data, using the icapp tool for linux and windows.