I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java...). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved.
I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let's say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible.
The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory.
Algorithms I already know about:
Try to compress a subset of the data, let's say 128 bytes (this is a bit slow)
Calculate the sum of 128 bytes, and if it's within a certain range then it's likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I'm looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)
Look at the file headers (relatively reliable, but feels like cheating)
I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5.
Update
I have implemented 'ASCII checking', 'entropy calculation', and 'simplified compression', and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers... I will accept the answer that gave the best results.
Additional answers (new ideas) are still welcome! If possible, with source code or links :-)
Update 2
A similar method is now implemented in Linux.
It says on your profile that you're the author of the H2 Database Engine, a database written in Java.
If I am guessing correctly, you are looking to engineer this database engine to automatically compress BLOB data, if possible.
But -- (I am guessing) you have realized that not everything will compress, and speed is important -- so you don't want to waste a microsecond more than is necessary when determining if you should compress data...
My question is engineering in nature -- why do all this? Basically, isn't it second-guessing the intent of the database user / application developer -- at the expense of speed?
Wouldn't you think that an application developer (who is writing data to the blob fields in the first place) would be the best person to make the decision if data should be compressed or not, and if so -- to choose the appropriate compression method?
The only possible place I can see automatic database compression possibly adding some value is in text/varchar fields -- and only if they're beyond a certain length -- but even so, that option might be better decided by the application developer... I might even go so far as to allow the application developer a compression plug-in, if so... That way they can make their own decisions for their own data...
If my assumptions about what you are trying to do were wrong -- then I humbly apologize for saying what I said... (It's just one insignificant user's opinion.)
This problem is interesting alone because with for example zlib compressing uncompressible data takes much longer then compressing compressible data. So doing unsuccessful compression is especially expensive (for details see the links). Nice recent work in this area has been done by Harnik et al. from IBM.
Yes, the prefix method and byte order-0 entropy (called entropy in the other posts) are good indicators. Other good ways to guess if a file is compressable or not are (from the paper):
Here is the FAST paper and the slides.