How to write a large amount of data in a tarfile i

2020-07-19 02:37发布

问题:

I've wrote a small cryptographic module in python whose task is to cipher a file and put the result in a tarfile. The original file to encrypt can be quit large, but that's not a problem because my program only need to work with a small block of data at a time, that can be encrypted on the fly and stored.

I'm looking for a way to avoid doing it in two passes, first writing all the data in a temporary file then inserting result in a tarfile.

Basically I do the following (where generator_encryptor is a simple generator that yield chunks of data read from sourcefile). :

t = tarfile.open("target.tar", "w")
tmp = file('content', 'wb')
for chunk in generator_encryptor("sourcefile"):
   tmp.write(chunks)
tmp.close()
t.add(content)
t.close()

I'm a bit annoyed having to use a temporary file as I file it should be easy to write blocs directly in the tar file, but collecting every chunks in a single string and using something like t.addfile('content', StringIO(bigcipheredstring) seems excluded because I can't guarantee that I have memory enough to old bigcipheredstring.

Any hint of how to do that ?

回答1:

You can create an own file-like object and pass to TarFile.addfile. Your file-like object will generate the encrypted contents on the fly in the fileobj.read() method.



回答2:

Huh? Can't you just use the subprocess module to run a pipe through to tar? That way, no temporary file should be needed. Of course, this won't work if you can't generate your data in small enough chunks to fit in RAM, but if you have that problem, then tar isn't the issue.



回答3:

Basically using a file-like object and passing it to TarFile.addfile do the trick, but there is still some issues open.

  • I need to known the full encrypted file size at the beginning
  • the way tarfile access to read method is such that the custom file-like object must always return full read buffers, or tarfile suppose it's end of file. It leads to some really inefficient buffer copying in the code of read method, but it's either that or change tarfile module.

The resulting code is below, basically I had to write a wrapper class that transform my existing generator into a file-like object. I also added the GeneratorEncrypto class in my example to make code compleat. You can notice it has a len method that returns the length of the written file (but understand it's just a dummy placeholder that does nothing usefull).

import tarfile

class GeneratorEncryptor(object):
    """Dummy class for testing purpose

       The real one perform on the fly encryption of source file
    """
    def __init__(self, source):
        self.source = source
        self.BLOCKSIZE = 1024
        self.NBBLOCKS = 1000

    def __call__(self):
        for c in range(0, self.NBBLOCKS):
            yield self.BLOCKSIZE * str(c%10)

    def __len__(self):
        return self.BLOCKSIZE * self.NBBLOCKS

class GeneratorToFile(object):
    """Transform a data generator into a conventional file handle
    """
    def __init__(self, generator):
        self.buf = ''
        self.generator = generator()

    def read(self, size):
        chunk = self.buf
        while len(chunk) < size:
            try:
                chunk = chunk + self.generator.next()
            except StopIteration:
                self.buf = ''
                return chunk
        self.buf = chunk[size:]
        return chunk[:size]

t = tarfile.open("target.tar", "w")
tmp = file('content', 'wb')
generator = GeneratorEncryptor("source")
ti = t.gettarinfo(name = "content")
ti.size = len(generator)
t.addfile(ti, fileobj = GeneratorToFile(generator))
t.close()


回答4:

I guess you need to understand how the tar format works, and handle the tar writing yourself. Maybe this can be helpful?

http://mail.python.org/pipermail/python-list/2001-August/100796.html



标签: python tar