I'm working on converting my backup script from shell to Python. One of the features of my old script was to check the created tarfile for integrity by doing: gzip -t .
This seems to be a bit tricky in Python.
It seems that the only way to do this, is by reading each of the compressed TarInfo objects within the tarfile.
Is there a way to check a tarfile for integrity, without extracting to disk, or keeping it in memory (in it's entirety)?
Good people on #python on freenode suggested that I should read each TarInfo object chunk-by-chunk, discarding each chunk read.
I must admit that I have no idea how to do this, seeing that I just started Python.
Imagine that I have a tarfile of 30GB which contains files ranging from 1kb to 10GB...
This is the solution that I started writing:
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
for member_info in tardude.getmembers():
try:
check = tardude.extractfile(member_info.name)
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
This code is far from finished. I would not dare running this on a huge 30GB tar archive, because at one point, check would be an object of 10+GB (If i have such huge files within the tar archive)
Bonus: I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError... Here is the output:
Traceback (most recent call last):
File "./test.py", line 31, in <module>
for member_info in tardude.getmembers():
File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
self._load() # all members, we first have to
File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
tarinfo = self.next()
File "/usr/lib/python2.7/tarfile.py", line 2315, in next
self.fileobj.seek(self.offset)
File "/usr/lib/python2.7/gzip.py", line 429, in seek
self.read(1024)
File "/usr/lib/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 320, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L
You can use the
subprocess
module to callgzip -t
on the file...If
result
is not 0, something is amiss. You might want to check if gzip is available, though. I wrote a utility function for that;If you look at the traceback, you'll see it's being thrown when you call
tardude.getmembers()
, so you'll need something like...As for the original problem, you're almost there. You just need to read the data from your
check
object with something like......which should ensure you never use more than
BLOCK_SIZE
bytes of memory at a time.Also, you should try to avoid using...
...because it will mask unexpected exceptions. Try to only catch the exception you actually intend to handle, like...
...otherwise you'll find it more difficult to detect bugs in your code.
Just a minor improvement on Aya's answer to make things a little more idiomatic (although I'm removing some of the error checking to make the mechanics more visible):
This really just removes the
while 1:
(sometimes considered a minor code smell) and theif not data:
check. Also note that the use ofwith
restricts this to Python 2.7+