I'm trying to figure out the best way to compress a stream with Python's zlib
.
I've got a file-like input stream (input
, below) and an output function which accepts a file-like (output_function
, below):
with open("file") as input:
output_function(input)
And I'd like to gzip-compress input
chunks before sending them to output_function
:
with open("file") as input:
output_function(gzip_stream(input))
It looks like the gzip module assumes that either the input or the output will be a gzip'd file-on-disk… So I assume that the zlib module is what I want.
However, it doesn't natively offer a simple way to create a stream file-like… And the stream-compression it does support comes by way of manually adding data to a compression buffer, then flushing that buffer.
Of course, I could write a wrapper around zlib.Compress.compress
and zlib.Compress.flush
(Compress
is returned by zlib.compressobj()
), but I'd be worried about getting buffer sizes wrong, or something similar.
So, what's the simplest way to create a streaming, gzip-compressing file-like with Python?
Edit: To clarify, the input stream and the compressed output stream are both too large to fit in memory, so something like output_function(StringIO(zlib.compress(input.read())))
doesn't really solve the problem.
Use the cStringIO (or StringIO) module in conjunction with zlib:
The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.
Update
This answer does not work. Example:
output:
Here is a cleaner, non-self-referencing version based on Ricardo Cárdenes' very helpful answer.
Advantages:
It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using
gzip
instead ofzlib
directly.Basically,
GzipWrap
is a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)Of course, it produces binary so there was no sense in implementing "readline".
You should be able to expand it to cover other cases or to be used as an iterable object itself.