Why is seeking from the end of a file allowed for

2019-05-03 12:29发布

The Question

I am parsing large compressed files in Python 2.7.6 and would like to know the uncompressed file size before starting. I am trying to use the second technique presented in this SO answer. It works for bzip2 formatted files but not gzip formatted files. What is different about the two compression algorithms that causes this?

Example Code

This code snipped demonstrates the behavior, assuming you have "test.bz2" and "test.gz" present in your current working directory:

import os
import bz2
import gzip

bz = bz2.BZ2File('test.bz2', mode='r')
bz.seek(0, os.SEEK_END)
bz.close()

gz = gzip.GzipFile('test.gz', mode='r')
gz.seek(0, os.SEEK_END)
gz.close()

The following traceback is shown:

Traceback (most recent call last):
  File "zip_test.py", line 10, in
    gz.seek(0, os.SEEK_END)
  File "/usr/lib64/python2.6/gzip.py", line 420, in seek
    raise ValueError('Seek from end not supported')
ValueError: Seek from end not supported

Why does this work for *.bz2 files but not *.gz files?

1条回答
我只想做你的唯一
2楼-- · 2019-05-03 12:58

In simple terms, gzip is a stream compressor, which means that each compressed element depends on the previous one. Seeking would be pointless, because whole file would have to be decompressed anyway. Probably the authors of gzip.py assumed it is better to raise an error instead of silently decompressing the file, so that the user can realize that seeking is inefficient.

On the other hand bzip2 is a block compressor, each block is independent.

If you really want random access to a gzipped file, then write a wrapper which decompresses the contents and returns a buffer which offers seeking. Unfortunately that would defeat the optimisation which is mentioned in the link from your question.

查看更多
登录 后发表回答