Checking tarfile integrity in Python

2019-03-02 15:45发布

问题:

I'm working on converting my backup script from shell to Python. One of the features of my old script was to check the created tarfile for integrity by doing: gzip -t .

This seems to be a bit tricky in Python.

It seems that the only way to do this, is by reading each of the compressed TarInfo objects within the tarfile.

Is there a way to check a tarfile for integrity, without extracting to disk, or keeping it in memory (in it's entirety)?

Good people on #python on freenode suggested that I should read each TarInfo object chunk-by-chunk, discarding each chunk read.

I must admit that I have no idea how to do this, seeing that I just started Python.

Imagine that I have a tarfile of 30GB which contains files ranging from 1kb to 10GB...

This is the solution that I started writing:

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

for member_info in tardude.getmembers():
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

This code is far from finished. I would not dare running this on a huge 30GB tar archive, because at one point, check would be an object of 10+GB (If i have such huge files within the tar archive)

Bonus: I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError... Here is the output:

Traceback (most recent call last):
  File "./test.py", line 31, in <module>
    for member_info in tardude.getmembers():
  File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
    tarinfo = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2315, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib/python2.7/gzip.py", line 429, in seek
    self.read(1024)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 320, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L

回答1:

Just a minor improvement on Aya's answer to make things a little more idiomatic (although I'm removing some of the error checking to make the mechanics more visible):

BLOCK_SIZE = 1024

with tarfile.open("zero.tar.gz") as tardude:
    for member in tardude.getmembers():
        with tardude.extractfile(member.name) as target:
            for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
                pass

This really just removes the while 1: (sometimes considered a minor code smell) and the if not data: check. Also note that the use of with restricts this to Python 2.7+



回答2:

I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError...

If you look at the traceback, you'll see it's being thrown when you call tardude.getmembers(), so you'll need something like...

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

try:
    members = tardude.getmembers()
except:
    print "There was an error reading tarfile members."

for member_info in members:
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

As for the original problem, you're almost there. You just need to read the data from your check object with something like...

BLOCK_SIZE = 1024

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

try:
    members = tardude.getmembers()
except:
    print "There was an error reading tarfile members."

for member_info in members:
    try:            
        check = tardude.extractfile(member_info.name)
        while 1:
            data = check.read(BLOCK_SIZE)
            if not data:
                break
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

...which should ensure you never use more than BLOCK_SIZE bytes of memory at a time.

Also, you should try to avoid using...

try:
    do_something()
except:
    do_something_else()

...because it will mask unexpected exceptions. Try to only catch the exception you actually intend to handle, like...

try:
    do_something()
except IOError:
    do_something_else()

...otherwise you'll find it more difficult to detect bugs in your code.



回答3:

You can use the subprocess module to call gzip -t on the file...

from subprocess import call
import os

with open(os.devnull, 'w') as bb:
    result = call(['gzip', '-t', "zero.tar.gz"], stdout=bb, stderr=bb)

If result is not 0, something is amiss. You might want to check if gzip is available, though. I wrote a utility function for that;

import subprocess
import sys
import os

def checkfor(args, rv = 0):
    """Make sure that a program necessary for using this script is
    available.

    Arguments:
    args  -- string or list of strings of commands. A single string may
             not contain spaces.
    rv    -- expected return value from evoking the command.
    """
    if isinstance(args, str):
        if ' ' in args:
            raise ValueError('no spaces in single command allowed')
        args = [args]
    try:
        with open(os.devnull, 'w') as bb:
            rc = subprocess.call(args, stdout=bb, stderr=bb)
        if rc != rv:
            raise OSError
    except OSError as oops:
        outs = "Required program '{}' not found: {}."
        print(outs.format(args[0], oops.strerror))
        sys.exit(1)