How to find the origin of data?

2019-08-01 02:44发布

问题:

So far the files are just being downloaded individually like the following rather than all being in one zipped file:

s3client = boto3.client('s3')

t.download_file(‘firstbucket’, obj['Key'], filename)

回答1:

Let me save you some trouble by using AWS CLI:

aws s3 cp s3://mybucket/mydir/ . --recursive ; zip myzip.zip *.csv

You can change the wildcard to suit your needs but this will work inherently faster than Python seeing as AWS CLI has been optimized far beyond the capabilities of boto



回答2:

if you want to use boto you'll have to do it in a loop like you have and add each item to a zip file.

with the CLI you can use s3 sync and then zip that up https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

aws s3 sync s3://bucket-name ./local-location && zip bucket.zip ./local-location



回答3:

It looks like you're really close, but you need to pass a file name to ZipFile.write() and download_file does not return a file name. The following should work alright, but I haven't tested it exhaustively.

from tempfile import NamedTemporaryFile
from zipfile import ZipFile

import boto3


def archive_bucket(bucket_name, zip_name):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')

    with ZipFile(zip_name, 'w') as zf:
        for page in paginator.paginate(Bucket=bucket_name):
            for obj in page['Contents']:
                # This might have issues on some systems since the file will
                # be open for writes in two places. You can use other
                # methods of creating a temporary file to work around that.
                with NamedTemporaryFile() as f:
                    s3.download_file(bucket_name, obj['Key'], f.name)
                    # Copies over the temprary file using the key as the
                    # file name in the zip.
                    zf.write(f.name, obj['Key'])

This has less space usage than the solutions using the CLI, but it still isn't ideal. You will still have two copies of a given file at some point in time: one in the temp file and one that has been zipped up. So you need to make sure that you have enough space on disk to support the size of all the files you're downloading plus the size of the largest of those files. If there were a way to open a file-like object that wrote directly to a file in the zip directory then you could get around that. I don't know how to do that however.