I was thinking about compression, and it seems like there would have to be some sort of limit to the compression that could be applied to it, otherwise it'd be a single byte.
So my question is, how many times can I compress a file before:
- It does not get any smaller?
- The file becomes corrupt?
Are these two points the same or different?
Where does the point of diminishing returns appear?
How can these points be found?
I'm not talking about any specific algorithm or particular file, just in general.
In general, not even one. Whatever compression algorithm you use, there must always exists a file that does not get compressed at all, otherwise you could always compress repeatedly until you reach 1 byte, by your same argument.
If the program you use to compress the file does its job, the file will never corrupt (of course I am thinking to lossless compression).
Generally the limit is one compression. Some algorithms results in a higher compression ratio, and using a poor algorithm followed by a good algorithm will often result in improvements. But using the good algorithm in the first place is the proper thing to do.
There is a theoretical limit to how much a given set of data can be compressed. To learn more about this you will have to study information theory.
Here is the ultimate compression algorithm (in Python) which by repeated use will compress any string of digits down to size 0 (it's left as an exercise to the reader how to apply this to a string of bytes).
The program outputs 12 11 10 09 08 07 06 05 04 03 02 01 00 9 8 7 6 5 4 3 2 1 0 then empty string. It doesn't compress the string at each pass but it will with enough passes compress any digit string down to a zero length string. Make sure you write down how many times you send it through the compressor otherwise you won't be able to get it back.
It all depends on the algorithm. In other words the question can be how many times a file can be compressed using this algorithm first, then this one next...
In theory, we will never know, it is a never-ending thing:
(source)
Example of a more advanced compression technique using "a double table, or cross matrix" Also elimiates extrenous unnessacry symbols in algorithm
[PREVIOUS EXAMPLE] Take run-length encoding (probably the simplest useful compression) as an example.
04 04 04 04 43 43 43 43 51 52 11 bytes
That series of bytes could be compressed as:
[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)
[TURNS INTO] 04.43.51.52 VALUES 4.4.**-2 COMPRESSION
Further Compression Using Additonal Symbols as substitute values
04.A.B.C VALUES 4.4.**-2 COMPRESSION