Find gzip start and end?

2019-05-01 17:23发布

问题:

I have some file, there's some random bytes, and multiple gzip files. How can i find start and end of gzip stream inside the some file? there's many random bytes between gzip streams. So, basically i need to find any gzip file and get it from there.

回答1:

Reading from the RFC 1952 - GZIP :

Each GZIP file is just a bunch of data chunks (called members), one for each file contained.

Each member starts with the following bytes:

  • 0x1F (ID1)
  • 0x8B (ID2)
  • compression method. 0x08 for a DEFLATEd file. 0-7 are reserved values.
  • flags. The top three bits are reserved and must be zero.
  • (4 bytes) last modified time. May be set to 0.
  • extra flags, defined by the compression method.
  • operating system, actually the file system. 0=FAT, 3=UNIX, 11=NTFS

The end of a member is not delimited. You have to actually walk the entire member. Note that concatenating multiple valid GZIP files creates a valid GZIP file. Also note that overshooting a member may still result in a successful reading of the member (unless the decompressing library is fail-eagerly-and-completely).



回答2:

Search for a three-byte gzip signature, 0x1f 0x8b 0x08. When you find it, try to decode a gzip stream starting with the 0x1f. If you succeed, then that was a gzip stream, and it ended where it ended. Continue the search from after that gzip stream if it is one, or after the 0x08 if it isn't. Then you will find all of them and you will know their location and span.