I'm working on converting my backup script from shell to Python. One of the features of my old script was to check the created tarfile for integrity by doing: gzip -t .
This seems to be a bit tricky in Python.
It seems that the only way to do this, is by reading each of the compressed TarInfo objects within the tarfile.
Is there a way to check a tarfile for integrity, without extracting to disk, or keeping it in memory (in it's entirety)?
Good people on #python on freenode suggested that I should read each TarInfo object chunk-by-chunk, discarding each chunk read.
I must admit that I have no idea how to do this, seeing that I just started Python.
Imagine that I have a tarfile of 30GB which contains files ranging from 1kb to 10GB...
This is the solution that I started writing:
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
for member_info in tardude.getmembers():
try:
check = tardude.extractfile(member_info.name)
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
This code is far from finished. I would not dare running this on a huge 30GB tar archive, because at one point, check would be an object of 10+GB (If i have such huge files within the tar archive)
Bonus:
I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError... Here is the output:
Traceback (most recent call last):
File "./test.py", line 31, in <module>
for member_info in tardude.getmembers():
File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
self._load() # all members, we first have to
File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
tarinfo = self.next()
File "/usr/lib/python2.7/tarfile.py", line 2315, in next
self.fileobj.seek(self.offset)
File "/usr/lib/python2.7/gzip.py", line 429, in seek
self.read(1024)
File "/usr/lib/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 320, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L
Just a minor improvement on Aya's answer to make things a little more idiomatic (although I'm removing some of the error checking to make the mechanics more visible):
BLOCK_SIZE = 1024
with tarfile.open("zero.tar.gz") as tardude:
for member in tardude.getmembers():
with tardude.extractfile(member.name) as target:
for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
pass
This really just removes the while 1:
(sometimes considered a minor code smell) and the if not data:
check. Also note that the use of with
restricts this to Python 2.7+
I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes
midfile). The first except does not catch IOError...
If you look at the traceback, you'll see it's being thrown when you call tardude.getmembers()
, so you'll need something like...
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
try:
members = tardude.getmembers()
except:
print "There was an error reading tarfile members."
for member_info in members:
try:
check = tardude.extractfile(member_info.name)
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
As for the original problem, you're almost there. You just need to read the data from your check
object with something like...
BLOCK_SIZE = 1024
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
try:
members = tardude.getmembers()
except:
print "There was an error reading tarfile members."
for member_info in members:
try:
check = tardude.extractfile(member_info.name)
while 1:
data = check.read(BLOCK_SIZE)
if not data:
break
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
...which should ensure you never use more than BLOCK_SIZE
bytes of memory at a time.
Also, you should try to avoid using...
try:
do_something()
except:
do_something_else()
...because it will mask unexpected exceptions. Try to only catch the exception you actually intend to handle, like...
try:
do_something()
except IOError:
do_something_else()
...otherwise you'll find it more difficult to detect bugs in your code.
You can use the subprocess
module to call gzip -t
on the file...
from subprocess import call
import os
with open(os.devnull, 'w') as bb:
result = call(['gzip', '-t', "zero.tar.gz"], stdout=bb, stderr=bb)
If result
is not 0, something is amiss. You might want to check if gzip is available, though. I wrote a utility function for that;
import subprocess
import sys
import os
def checkfor(args, rv = 0):
"""Make sure that a program necessary for using this script is
available.
Arguments:
args -- string or list of strings of commands. A single string may
not contain spaces.
rv -- expected return value from evoking the command.
"""
if isinstance(args, str):
if ' ' in args:
raise ValueError('no spaces in single command allowed')
args = [args]
try:
with open(os.devnull, 'w') as bb:
rc = subprocess.call(args, stdout=bb, stderr=bb)
if rc != rv:
raise OSError
except OSError as oops:
outs = "Required program '{}' not found: {}."
print(outs.format(args[0], oops.strerror))
sys.exit(1)