This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
Pretty simple. We open the zip file and one or two CSV files inside the zip file.
What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile
from a StringIO.StringIO
or a io.BytesIO
(Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile
from a TemporaryUploadedFile
or even a file object created by calling os.tmpfile()
and shutil.copyfileobj()
) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile
exception, it seems to read corrupted data and gets very confused.
This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open
on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename)
, but perhaps it's problematic when passing ZipFile
a file-like object, because of some implementation details in the zipfile
module?
Perhaps I missed something in the zipfile
docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile
module?