We have ZIP files that are 5-10GB in size. The typical ZIP file has 5-10 internal files, each 1-5 GB in size uncompressed.
I have a nice set of Python tools for reading these files. Basically, I can open a filename and if there is a ZIP file, the tools search in the ZIP file and then open the compressed file. It's all rather transparent.
I want to store these files in Amazon S3 as compressed files. I can fetch ranges of S3 files, so it should be possible to fetch the ZIP central directory (it's the end of the file, so I can just read the last 64KiB), find the component I want, download that, and stream directly to the calling process.
So my question is, how do I do that through the standard Python ZipFile API? It isn't documented how to replace the filesystem transport with an arbitrary object that supports POSIX semantics. Is this possible without rewriting the module?
So here is the code that allows you to open a file on Amazon S3 as if it were a normal file. Notice I use the
aws
command, rather than theboto3
Python module. (I don't have access to boto3.) You can open the file and seek on it. The file is cached locally. If you open the file with the Python ZipFile API and it's a ZipFile, you can then read individual parts. You can't write, though, because S3 doesn't support partial writes.Separately, I implement
s3open()
, which can open a file for reading or writing, but it doesn't implement the seek interface, which is required byZipFile.
Here's an approach which does not need to fetch the entire file (full version available here).
It does require
boto
(orboto3
), though (unless you can mimic the rangedGET
s via AWS CLI; which I guess is quite possible as well).In your case you might need to write the fetched content to a local file (due to large size), unless memory usage is not a concern.