I spent a few days reading zlib (and gzip and deflate) RFC and I can say they are kind of rubbish. Quite some details are missing, so I'm opening this question.
I'm trying to parse a zlib data and I need to know some details about the header.
First of all, RFC says there will be 2 bytes, CMF
and FLG
.
CMF
is divided in 2 4 bits sections. The first one is CM
and the second one is CINFO
.
What are the possible values of CM
? RFC says that 8
means deflate
and that 15
is reserved, but what about the rest of the possible values?
CINFO
on the other side, should be always 8, if I understand the RFC correctly (please correct me if I'm wrong).
Skipping FLG
and the possible FDICT
, we get to the Compressed data
section. This part of the RFC says:
For compression method 8, the compressed data is stored in the
deflate compressed data format as described in the document
"DEFLATE Compressed Data Format Specification" by L. Peter
Deutsch. (See reference [3] in Chapter 3, below)
What does this mean? Should I assume that CM
will always be 8? If yes
, then why does the entire CM
thing exists?
Last, I'm a bit confused. I always believe zlib can wrap both deflate and gzip, but reading this RFC I can't see where a gzip compressed data fits in here. Is there anything that I'm missing about this?
What are the possible values of CM
? RFC says that 8
means deflate
and that 15
is reserved, but what about the rest of the possible values?
...
Should I assume that CM
will always be 8? If yes
, then why does the entire CM
thing exists?
CM
is there for future use and to allow other (non-standard) compression methods:
Other compressed data formats are not specified in this version of the zlib specification. (RFC 1950, "ZLIB Compressed Data Format Specification version 3.3")
You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.
CINFO
on the other side, should be always 8, if I understand the RFC correctly (please correct me if I'm wrong).
No, the meaning of CINFO
depends on CM
. If CM
is 8 (the only meaningful standardized value), then:
CINFO
is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7
indicates a 32K window size). Values of CINFO
above 7 are not allowed in this version of the specification. (RFC 1950, "ZLIB Compressed Data Format Specification version 3.3")
So in fact CINFO
can't be 8.
Skipping FLG
and the possible FDICT
, we get to the Compressed data
section. This part of the RFC says:
For compression method 8, the compressed data is stored in the
deflate compressed data format as described in the document
"DEFLATE Compressed Data Format Specification" by L. Peter
Deutsch. (See reference [3] in Chapter 3, below)
What does this mean?
It means that the details for the DEFLATE encoding is not specified in this standard, but is described elsewhere, at ftp://ftp.uu.net/pub/archiving/zip/zlib/.
If you prefer, DEFLATE has its own RFC, that is RFC 1951, "DEFLATE Compressed Data Format Specification version 1.3".
Last, I'm a bit confused. I always believe zlib can wrap both deflate and gzip, but reading this RFC I can't see where a gzip compressed data fits in here. Is there anything that I'm missing about this?
No, zlib can't wrap gzip. gzip and zlib are different wrappers for deflate data (as is the zip format, the PNG format, the PDF format, etc.)
Gzip uses DEFLATE:
The format presently uses the DEFLATE method of compression but can be easily extended to use other compression methods. (RFC 1952, "GZIP file format specification version 4.3")
CM = 8
denotes the "deflate" compression method with a window size up to 32K. This is the method used by gzip and PNG (RFC 1950, "ZLIB Compressed Data Format Specification version 3.3")
If you find the RFC unclear or difficult to understand, consider looking into the source code for an implementation of zlib. While some implementations may be non-standard, looking at the source may help you solve some of your doubts.
Here's an excerpt from the source code of zlib from zlib.net that answers one of your questions:
#define Z_DEFLATED 8
/* ... */
if (BITS(4) != Z_DEFLATED) {
strm->msg = (char *)"unknown compression method";
state->mode = BAD;
break;
}