Bzip2 block header: 1AY&SY

2019-02-19 06:27发布

This is the question about bzip2 archive format. Any Bzip2 archive consists of file header, one or more blocks and tail structure. All blocks should start with "1AY&SY", 6 bytes of BCD-encoded digits of the Pi number, 0x314159265359. According to the source of bzip2:

/*--
  A 6-byte block header, the value chosen arbitrarily
  as 0x314159265359 :-).  A 32 bit value does not really
  give a strong enough guarantee that the value will not
  appear by chance in the compressed datastream.  Worst-case
  probability of this event, for a 900k block, is about
  2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
  For a compressed file of size 100Gb -- about 100000 blocks --
  only a 48-bit marker will do.  NB: normal compression/
  decompression do *not* rely on these statistical properties.
  They are only important when trying to recover blocks from
  damaged files.
--*/

The question is: Is it true, that all bzip2 archives will have blocks with start aligned to byte boundary? I mean all archives created by reference implementation of bzip2, the bzip2-1.0.5+ utility.

I think that bzip2 may parse the stream not as byte stream but as bit stream (the block itself is encoded by huffman, which is not byte-aligned by design).

So, in other words: If grep -c 1AY&SY greater (huffman may generate 1AY&SY inside block) or equal to count of bzip2 blocks in the file?

1条回答
倾城 Initia
2楼-- · 2019-02-19 07:12

BZIP2 looks at a bit stream.

From http://blastedbio.blogspot.com/2011/11/random-access-to-bzip2.html:

Anyway, the important bits are that a BZIP2 file contains one or more "streams", which are byte aligned, each containing one (zero?) or more "blocks", which are not byte aligned, followed by an end of stream marker (the six bytes 0x177245385090 which is the square root of pi as a binary coded decimal (BCD), a four byte checksum, and empty bits for byte alignment).

The bzip2 wikipedia article also alludes to bit-block alignment (see the File Format section), which seems to be inline from what I remember from school (had to implement the algorithm...).

查看更多
登录 后发表回答