This is related to the question about zip bombs, but having gzip or bzip2 compression in mind, e.g. a web service accepting .tar.gz
files.
Python provides a handy tarfile module that is convenient to use, but does not seem to provide protection against zipbombs.
In python code using the tarfile module, what would be the most elegant way to detect zip bombs, preferably without duplicating too much logic (e.g. the transparent decompression support) from the tarfile module?
And, just to make it a bit less simple: No real files are involved; the input is a file-like object (provided by the web framework, representing the file a user uploaded).
You could use resource
module to limit resources available to your process and its children.
If you need to decompress in memory then you could set resource.RLIMIT_AS
(or RLIMIT_DATA
, RLIMIT_STACK
) e.g., using a context manager to automatically restore it to a previous value:
import contextlib
import resource
@contextlib.contextmanager
def limit(limit, type=resource.RLIMIT_AS):
soft_limit, hard_limit = resource.getrlimit(type)
resource.setrlimit(type, (limit, hard_limit)) # set soft limit
try:
yield
finally:
resource.setrlimit(type, (soft_limit, hard_limit)) # restore
with limit(1 << 30): # 1GB
# do the thing that might try to consume all memory
If the limit is reached; MemoryError
is raised.
This will determine the uncompressed size of the gzip stream, while using limited memory:
#!/usr/bin/python
import sys
import zlib
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
while True:
buf = z.unconsumed_tail
if buf == "":
buf = f.read(1024)
if buf == "":
break
got = z.decompress(buf, 4096)
if got == "":
break
total += len(got)
print total
if z.unused_data != "" or f.read(1024) != "":
print "warning: more input after end of gzip stream"
It will return a slight overestimate of the space required for all of the files in the tar file in when extracted. The length includes those files, as well as the tar directory information.
The gzip.py code does not control the amount of data decompressed, except by virtue of the size of the input data. In gzip.py, it reads 1024 compressed bytes at a time. So you can use gzip.py if you're ok with up to about 1056768 bytes of memory usage for the uncompressed data (1032 * 1024, where 1032:1 is the maximum compression ratio of deflate). The solution here uses zlib.decompress
with the second argument, which limits the amount of uncompressed data. gzip.py does not.
This will accurately determine the total size of the extracted tar entries by decoding the tar format:
#!/usr/bin/python
import sys
import zlib
def decompn(f, z, n):
"""Return n uncompressed bytes, or fewer if at the end of the compressed
stream. This only decompresses as much as necessary, in order to
avoid excessive memory usage for highly compressed input.
"""
blk = ""
while len(blk) < n:
buf = z.unconsumed_tail
if buf == "":
buf = f.read(1024)
got = z.decompress(buf, n - len(blk))
blk += got
if got == "":
break
return blk
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
left = 0
while True:
blk = decompn(f, z, 512)
if len(blk) < 512:
break
if left == 0:
if blk == "\0"*512:
continue
if blk[156] in ["1", "2", "3", "4", "5", "6"]:
continue
if blk[124] == 0x80:
size = 0
for i in range(125, 136):
size <<= 8
size += blk[i]
else:
size = int(blk[124:136].split()[0].split("\0")[0], 8)
if blk[156] not in ["x", "g", "X", "L", "K"]:
total += size
left = (size + 511) // 512
else:
left -= 1
print total
if blk != "":
print "warning: partial final block"
if left != 0:
print "warning: tar file ended in the middle of an entry"
if z.unused_data != "" or f.read(1024) != "":
print "warning: more input after end of gzip stream"
You could use a variant of this to scan the tar file for bombs. This has the advantage of finding a large size in the header information before you even have to decompress that data.
As for .tar.bz2 archives, the Python bz2 library (at least as of 3.3) is unavoidably unsafe for bz2 bombs consuming too much memory. The bz2.decompress
function does not offer a second argument like zlib.decompress
does. This is made even worse by the fact that the bz2 format has a much, much higher maximum compression ratio than zlib due to run-length coding. bzip2 compresses 1 GB of zeros to 722 bytes. So you cannot meter the output of bz2.decompress
by metering the input as can be done with zlib.decompress
even without the second argument. The lack of a limit on the decompressed output size is a fundamental flaw in the Python interface.
I looked in the _bz2module.c in 3.3 to see if there is an undocumented way to use it to avoid this problem. There is no way around it. The decompress
function in there just keeps growing the result buffer until it can decompress all of the provided input. _bz2module.c needs to be fixed.
If you develop for linux, you can run decompression in separate process and use ulimit to limit the memory usage.
import subprocess
subprocess.Popen("ulimit -v %d; ./decompression_script.py %s" % (LIMIT, FILE))
Keep in mind that decompression_script.py should decompress the whole file in memory, before writing to disk.
I guess the answer is: There is no easy, readymade solution. Here is what I use now:
class SafeUncompressor(object):
"""Small proxy class that enables external file object
support for uncompressed, bzip2 and gzip files. Works transparently, and
supports a maximum size to avoid zipbombs.
"""
blocksize = 16 * 1024
class FileTooLarge(Exception):
pass
def __init__(self, fileobj, maxsize=10*1024*1024):
self.fileobj = fileobj
self.name = getattr(self.fileobj, "name", None)
self.maxsize = maxsize
self.init()
def init(self):
import bz2
import gzip
self.pos = 0
self.fileobj.seek(0)
self.buf = ""
self.format = "plain"
magic = self.fileobj.read(2)
if magic == '\037\213':
self.format = "gzip"
self.gzipobj = gzip.GzipFile(fileobj = self.fileobj, mode = 'r')
elif magic == 'BZ':
raise IOError, "bzip2 support in SafeUncompressor disabled, as self.bz2obj.decompress is not safe"
self.format = "bz2"
self.bz2obj = bz2.BZ2Decompressor()
self.fileobj.seek(0)
def read(self, size):
b = [self.buf]
x = len(self.buf)
while x < size:
if self.format == 'gzip':
data = self.gzipobj.read(self.blocksize)
if not data:
break
elif self.format == 'bz2':
raw = self.fileobj.read(self.blocksize)
if not raw:
break
# this can already bomb here, to some extend.
# so disable bzip support until resolved.
# Also monitor http://stackoverflow.com/questions/13622706/how-to-protect-myself-from-a-gzip-or-bzip2-bomb for ideas
data = self.bz2obj.decompress(raw)
else:
data = self.fileobj.read(self.blocksize)
if not data:
break
b.append(data)
x += len(data)
if self.pos + x > self.maxsize:
self.buf = ""
self.pos = 0
raise SafeUncompressor.FileTooLarge, "Compressed file too large"
self.buf = "".join(b)
buf = self.buf[:size]
self.buf = self.buf[size:]
self.pos += len(buf)
return buf
def seek(self, pos, whence=0):
if whence != 0:
raise IOError, "SafeUncompressor only supports whence=0"
if pos < self.pos:
self.init()
self.read(pos - self.pos)
def tell(self):
return self.pos
It does not work well for bzip2, so that part of the code is disabled. The reason is that bz2.BZ2Decompressor.decompress
can already produce an unwanted large chunk of data.
I also need to handle zip bombs in uploaded zipfiles.
I do this by creating a fixed size tmpfs, and unzipping to that. If the extracted data is too large then the tmpfs will run out of space and give an error.
Here is the linux commands to create a 200M tmpfs to unzip to.
sudo mkdir -p /mnt/ziptmpfs
echo 'tmpfs /mnt/ziptmpfs tmpfs rw,nodev,nosuid,size=200M 0 0' | sudo tee -a /etc/fstab