get the filesize of very large .gz file on a 64bit

2019-04-26 21:49发布

问题:

According to the specifiction of gz the filesize is saved in the last 4bytes of a .gz file.

I have created 2 files with

dd if=/dev/urandom of=500M bs=1024 count=500000
dd if=/dev/urandom of=5G bs=1024 count=5000000

I gziped them

gzip 500M 5G

I checked the last 4 bytes doing

tail -c4 500M|od -I      (returns 512000000 as expected)
tail -c4 5G|od -I        (returns 825032704 as not expected)

It seems that hitting the invisible 32bit barrier, makes the value written into the ISIZE completely nonsense. Which is more annoying, than if they had used some error bit instead.

Does anyone know of a way to get the uncompressed .gz filesize from the .gz without extracting it?

thanks

specification: http://www.gzip.org/zlib/rfc-gzip.html

edit: if anyone to try it out, you could use /dev/zero instead of /dev/urandom

回答1:

There isn't one.

The only way to get the exact size of a compressed stream is to actually go and decompress it (even if you write everything to /dev/null and just count the bytes).

Its worth noting that ISIZE is defined as

ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.

in the gzip RFC so it isn't actually breaking at the 32-bit barrier, what you're seeing is expected behavior.



回答2:

I haven't tried this with a file of the size you mentioned, but I often find the uncompressed size of a .gz file with

zcat file.gz | wc -c

when I don't want to leave the uncompressed file lying around, or bother to compress it again.

Obviously, the data is uncompressed, but is then piped to wc.

It's worth a try, anyway.

EDIT: When I tried creating a 5G file with data from /dev/random it produced a file 5G of size 5120000000, although my file manager reported this as 4.8G

Then I compressed it with gzip 5G, the results 5G.gz was the same size (not much compression of random data).

Then zcat 5G.gz | wc -c reported the same size as the original file: 5120000000 bytes. So my suggestion seemed to have worked for this trial, anyway.

Thanks for waiting



回答3:

gzip does have a -l option:

       -l --list
          For each compressed file, list the following fields:

              compressed size: size of the compressed file
              uncompressed size: size of the uncompressed file
              ratio: compression ratio (0.0% if unknown)
              uncompressed_name: name of the uncompressed file

          The uncompressed size is given as -1 for files not in gzip format, such as compressed .Z files. To
          get the uncompressed size for such a file, you can use:

              zcat file.Z | wc -c

          In combination with the --verbose option, the following fields are also displayed:

              method: compression method
              crc: the 32-bit CRC of the uncompressed data
              date & time: time stamp for the uncompressed file

          The compression methods currently supported are deflate, compress, lzh (SCO compress -H) and pack.
          The crc is given as ffffffff for a file not in gzip format.

          With --name, the uncompressed name,  date and time  are those stored within the compress  file  if
          present.

          With --verbose, the size totals and compression ratio for all files is also displayed, unless some
          sizes are unknown. With --quiet, the title and totals lines are not displayed.