I want to allow users to download an archive of multiple large files at once. However, the files and the archive may be too large to store in memory or on disk on my server (they are streamed in from other servers on the fly). I'd like to generate the archive as I stream it to the user.
I can use Tar or Zip or whatever is simplest. I am using django, which allows me to return a generator or file-like object in my response. This object could be used to pump the process along. However, I am having trouble figuring out how to build this sort of thing around the zipfile or tarfile libraries, and I'm afraid they may not support reading files as they go, or reading the archive as it is built.
This answer on converting an iterator to a file-like object might help. tarfile#addfile
takes an iterable, but it appears to immediately pass that to shutil.copyfileobj
, so this may not be as generator-friendly as I had hoped.
You can stream a ZipFile to a Pylons or Django response fileobj by wrapping the fileobj in something file-like that implements
tell()
. This will buffer each individual file in the zip in memory, but stream the zip itself. We use it to stream download a zip file full of images, so we never buffer more than a single image in memory.This example streams to
sys.stdout
. For Pylons useresponse.body_file
, for Django you can use theHttpResponse
itself as a file.You can do it by generating and streaming a zip file with no compression, which is basically to just add the headers before each file's content. You're right, the libraries don't support this, but you can hack around them to get it working.
This code wraps zipfile.ZipFile with a class that manages the stream and creates instances of zipfile.ZipInfo for the files as they come. CRC and size can be set at the end. You can push data from the input stream into it with put_file(), write() and flush(), and read data out of it to the output stream with read().
Keep in mind that this code was just a quick proof of concept and I did no further development or testing once I decided to let the http server itself deal with this problem. A few things you should look into if you decide to use it is to check if nested folders are archived correctly, and filename encoding (which is always a pain with zip files anyway).
Here is the solution from Pedro Werneck (from above) but with a fix to avoid collecting all data in memory (
read
method is fixed a little bit):then you can use
stream_generator
function as a stream for a zip fileexample for Falcon:
I ended up using SpiderOak ZipStream.