The Question
I am parsing large compressed files in Python 2.7.6 and would like to know the uncompressed file size before starting. I am trying to use the second technique presented in this SO answer. It works for bzip2 formatted files but not gzip formatted files. What is different about the two compression algorithms that causes this?
Example Code
This code snipped demonstrates the behavior, assuming you have "test.bz2" and "test.gz" present in your current working directory:
import os
import bz2
import gzip
bz = bz2.BZ2File('test.bz2', mode='r')
bz.seek(0, os.SEEK_END)
bz.close()
gz = gzip.GzipFile('test.gz', mode='r')
gz.seek(0, os.SEEK_END)
gz.close()
The following traceback is shown:
Traceback (most recent call last):
File "zip_test.py", line 10, in
gz.seek(0, os.SEEK_END)
File "/usr/lib64/python2.6/gzip.py", line 420, in seek
raise ValueError('Seek from end not supported')
ValueError: Seek from end not supported
Why does this work for *.bz2 files but not *.gz files?
In simple terms, gzip is a stream compressor, which means that each compressed element depends on the previous one. Seeking would be pointless, because whole file would have to be decompressed anyway. Probably the authors of gzip.py assumed it is better to raise an error instead of silently decompressing the file, so that the user can realize that seeking is inefficient.
On the other hand bzip2 is a block compressor, each block is independent.
If you really want random access to a gzipped file, then write a wrapper which decompresses the contents and returns a buffer which offers seeking. Unfortunately that would defeat the optimisation which is mentioned in the link from your question.