Python cannot read “warc.gz” file completely

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.

I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.

After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.

I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.

I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?

Thanks in advance!

Example:

I create new warc.gz files like this:

import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")

To write records I use:

record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)

This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.

To read files I use:

warc_file = warc.open(warc_path, "rb")

To loop through records I use:

for record in warc_file:
    ...

The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.

标签： python gzip warc

1条回答

小情绪 Triste *

2楼-- · 2019-04-17 00:08

It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.

It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()

0人赞添加讨论(0) 举报

Python cannot read “warc.gz” file completely

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间