The most common method for corrupting compressed files is to inadvertently do an ASCII-mode FTP transfer, which causes a many-to-one trashing of CR and/or LF characters.
Obviously, there is information loss, and the best way to fix this problem is to transfer again, in FTP binary mode.
However, if the original is lost, and it's important, how recoverable is the data?
[Actually, I already know what I think is the best answer (it's very difficult but sometimes possible - I'll post more later), and the common non-answers (lots of off-the-shelf programs for repairing CRCs without repairing data), but I thought it would be interesting to try out this question during the stackoverflow beta period, and see if anyone else has gone down the successful-recovery path or discovered tools I don't know about.]
From Bukys Software
Approximately 1 in 256 bytes is known
to be corrupted, and the corruption is
known to occur only in bytes with the
value '\012'. So the byte error rate
is 1/256 (0.39% of input), and 2/256
bytes (0.78% of input) are suspect.
But since only three bits per smashed
byte are affected, the bit error rate
is only 3/(256*8): 0.15% is bad, 0.29%
is suspect.
...
An error in the compressed input
disrupts the decompression process for
all subsequent bytes...The fact that
the decompressed output is
recognizably bad so quickly is cause
for hope -- a search for the correct
answer can identify wrong answers
quickly.
Ultimately, several techniques were
combined to successfully extract
reasonable data from these files:
- Domain-specific parsing of fields and quoted strings
- Machine learning from previous data with low probability of damage
- Tolerance for file damage due to other causes (e.g. disk full while
logging)
- Lookahead for guiding the search along the highest-probability paths
These techniques identify 75% of the
necessary repairs with certainty, and
the remainder are explored
highest-probability-first, so that
plausible reconstructions are
identified immediately.
You could try writing a little script to replace all of the CRs with CRLFs (assuming the direction of trashing was CRLF to CR), swapping them randomly per block until you had the correct crc. Assuming that the data wasn't particularly large, I guess that might not use all of your CPU until the heat death of the universe to complete.
As there is definite information loss, I don't know that there is a better way. Loss in the CR to CRLF direction might be slightly easier to roll back.