Python cannot read “warc.gz” file completely

2019-04-16 23:46发布

问题:

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.

I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.

After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.

I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.

I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?

Thanks in advance!

Example:

I create new warc.gz files like this:

import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")

To write records I use:

record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)

This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.

To read files I use:

warc_file = warc.open(warc_path, "rb")

To loop through records I use:

for record in warc_file:
    ...

The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.

回答1:

It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.

It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()


标签: python gzip warc