If I GZip this text:
Hello World
through C# using this code:
Stream stream = new MemoryStream(Encoding.Default.GetBytes("Hello World"));
var compressedMemoryStream = new MemoryStream();
using (var gzipStream = new GZipStream(compressedMemoryStream, CompressionMode.Compress))
{
stream.CopyTo(gzipStream);
gzipStream.Close();
}
the resulting stream is 133 bytes long
Running the same string through either Fiddler's Utilities.GzipCompress
or this PHP page the result is only 31 bytes long.
In both cases the input is 11 bytes, so I would imagine the PHP result is correct but obviously this means that I can't decompress the PHP zip from within .NET or visa-versa. Why is the .NET output so much larger?
Actually it turns out that while the result from PHP and Fiddler are the same length that they are not the same. I can decompress the PHP version in .NET, but not the Fiddler version. The PHP page decompresses all three, so it looks like there may be an incompatibility between Fiddler's and .NET's implementations of gzip.
As requested I've uploaded the three outputs to dropbox here.
And these are the raw hexdumps of those files (not sure if they are really any use like this, but I think it shows that the difference between the fiddler and PHP version is in the header, rather than the compressed data itself):
Fiddler:
0000-0010: 1f 8b 08 00-c2 e6 ff 4f-00 ff f3 48-cd c9 c9 57 .......O ...H...W
0000-001f: 08 cf 2f ca-49 01 00 56-b1 17 4a 0b-00 00 00 ../.I..V ..J....
PHP:
0000-0010: 1f 8b 08 00-00 00 00 00-00 03 f3 48-cd c9 c9 57 ........ ...H...W
0000-001f: 08 cf 2f ca-49 01 00 56-b1 17 4a 0b-00 00 00 ../.I..V ..J....
C#:
0000-0010: 1f 8b 08 00-00 00 00 00-04 00 ec bd-07 60 1c 49 ........ .....`.I
0000-0020: 96 25 26 2f-6d ca 7b 7f-4a f5 4a d7-e0 74 a1 08 .%&/m.{. J.J..t..
0000-0030: 80 60 13 24-d8 90 40 10-ec c1 88 cd-e6 92 ec 1d .`.$..@. ........
0000-0040: 69 47 23 29-ab 2a 81 ca-65 56 65 5d-66 16 40 cc iG#).*.. eVe]f.@.
0000-0050: ed 9d bc f7-de 7b ef bd-f7 de 7b ef-bd f7 ba 3b .....{.. ..{....;
0000-0060: 9d 4e 27 f7-df ff 3f 5c-66 64 01 6c-f6 ce 4a da .N'...?\ fd.l..J.
0000-0070: c9 9e 21 80-aa c8 1f 3f-7e 7c 1f 3f-22 be 9d 97 ..!....? ~|.?"...
0000-0080: 65 95 7e b7-aa cb d9 ff-13 00 00 ff-ff 56 b1 17 e.~..... .....V..
0000-0085: 4a 0b 00 00-00
Modern compression tools will generally use more than one compression strategy. With Winzip and WinRAR etc, you'll typically get options like:
If you were to do the same, you'd probably be able to compress the file further.
No matter what content you feed into GZipStream you get the same overhead. GZipStream looks identical for the first 108 bytes
Up to
1f 8b 08 00 00 00 00 00 04 00
fits the standard definition ( ttp://www.faqs.org/rfcs/rfc1952.html ). The remainder of the fixed section has been explained by @mark-adler in Why does BCL GZipStream (with StreamReader) not reliably detect Data Errors with CRC32?GZipStream
adds a 10-byte header and a 8-byte footer to the compressed data as described in the RFC 1952 specifications. This gives a result that is 133 bytes long.The PHP page you linked to also adds the same 18-byte header/footer if asked to (
GZIP-compatible encoding?
). If you use that it gives a result that is 31 bytes long.Without the header/footer the difference between them is 125 versus 13 bytes.
Preface: .NET users should not use the Microsoft-provided GZipStream or DeflateStream classes under any circumstances, unless Microsoft replaces them completely with something that works. Use the DotNetZip library instead.
Update to Preface: The .NET Framework 4.5 and later have fixed the compression problem, and GZipStream and DeflateStream use zlib in those versions. I do not know if the CRC problem referenced below has been fixed.
Another update: The CRC problem is not only not fixed, but Microsoft has decided that they won't fix it!
This is one of several bugs in GZipStream. No self-respecting gzip compressor should ever produce 133 bytes of output from 11 bytes of input. See my comments at Why does BCL GZipStream (with StreamReader) not reliably detect Data Errors with CRC32? .
What is happening internally is that GZipStream is not using the static or stored methods, both of which would produce compressed data about the same size as the input data (on top of which would be added 18 bytes of gzip header and trailer). Instead it is using the dynamic method, which creates a very large code descriptor header for a very small number of codes. It is simply a bug / very bad implementation.
Update:
With the hex dumps, I can provide some analysis. First, both the Fiddler and php output are correct and proper. The only difference between them is in the gzip header, in particular the timestamp set in Fiddler but not in php, and the originating operating system set in php but not in Fiddler. For both the 13 bytes of compressed data is identical, and can be represented as (using my infgen program to disassemble deflate streams):
which is exactly as it should be. A single static block, which requires no code descriptors, and simply coding all of the bytes as literals. (No matches of previous strings with lengths and distances.)
The output of GZipStream on the other hand is a horrible mess in several ways. The compressed data is:
So what is all that? The actual data is just the line near the end "literal 'Hello World", which just codes each byte of the input. What precedes it is a description of a set of Huffman codes for literals, lengths, and distances. Here are the things wrong with it:
All of this points to the simple fact that whoever wrote this GZipStream code was, to put it as politely as I can, lacking in any understanding of the deflate format or compression in general. They elected to produce only dynamic blocks (except for an empty static block at the end), to only produce the same dynamic header every time (I think), defeating the purpose of dynamic blocks, and to not bother to figure out if the current block is last one, requiring putting out an empty block to mark the end.
As noted elsewhere, those aren't the only problems with GZipStream. It can't even properly use the CRC-32 as intended to detect corrupt streams.
The truly perplexing thing is not why Microsoft assigned someone incompetent to write a gzip compressor and decompressor, but rather why they assigned anyone at all to write it! There is freely available code, zlib, that has an extremely liberal license that permits commercial use with no attribution. This code has been deployed widely for almost two decades, and does all the things it's supposed to do correctly and efficiently. Most everything else uses zlib, including php and I suspect Fiddler as well.
Partially patents:
and
http://social.msdn.microsoft.com/Forums/fr-FR/c5f0b53c-a2d5-4407-b43b-9da8d39c01df/why-do-gzipstream-compression-ratio-so-bad?forum=netfxbcl
http://challenge-me.ws/post/2010/11/05/Do-Not-Take-Microsofts-Code-for-Granted.aspx