Boto3 to download all files from a S3 Bucket

2019-01-07 05:04发布

站内文章 / Python

53 0

叼着烟拽天下

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm using boto3 to get files from s3 bucket. I need a similar functionality like aws s3 sync

My current code is

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

This is working fine, as long as the bucket has only files. If a folder is present inside the bucket, its throwing an error

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

Is this a proper way to download a complete s3 bucket using boto3. How to download folders.

回答1:

I got the same needs and create the following function that download recursively the files. The directories are created locally only if they contain files.

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        if result.get('Contents') is not None:
            for file in result.get('Contents'):
                if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
                     os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
                resource.meta.client.download_file(bucket, file.get('Key'), local + os.sep + file.get('Key'))

The function is called that way:

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp')

回答2:

Amazon S3 does not have folders/directories. It is a flat file structure.

To maintain the appearance of directories, path names are stored as part of the object Key (filename). For example:

images/foo.jpg

In this case, they whole Key is images/foo.jpg, rather than just foo.jpg.

I suspect that your problem is that boto is returning a file called my_folder/.8Df54234 and is attempting to save it to the local filesystem. However, your local filesystem interprets the my_folder/ portion as a directory name, and that directory does not exist on your local filesystem.

You could either truncate the filename to only save the .8Df54234 portion, or you would have to create the necessary directories before writing files. Note that it could be multi-level nested directories.

An easier way would be to use the AWS Command-Line Interface (CLI), which will do all this work for you, eg:

aws s3 cp --recursive s3://my_bucket_name local_folder

There's also a sync option that will only copy new and modified files.

回答3:

import os
import boto3

#intiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for object in my_bucket.objects.all():
    # Need to split object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(object.key)
    my_bucket.download_file(object.key, filename)

回答4:

I'm currently achieving the task, by using the following

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
    s3_object = s3_key['Key']
    if not s3_object.endswith("/"):
        s3.download_file('bucket', s3_object, s3_object)
    else:
        import os
        if not os.path.exists(s3_object):
            os.makedirs(s3_object)

Although, it does the job, I'm not sure its good to do this way. I'm leaving it here to help other users and further answers, with better manner of achieving this

回答5:

Better late than never:) The previous answer with paginator is really good. However it is recursive, and you might end up hitting Python's recursion limits. Here's an alternate approach, with a couple of extra checks.

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

回答6:

It is a very bad idea to get all files in one go, you should rather get it in batches.

One implementation which I use to fetch a particular folder (directory) from S3 is,

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)

    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

and still if you want to get the whole bucket use it via CIL as @John Rotenstein mentioned as below,

aws s3 cp --recursive s3://bucket_name download_path

回答7:

I have a workaround for this that runs the AWS CLI in the same process.

Install awscli as python lib:

pip install awscli

Then define this function:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

To execute:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

回答8:

for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

This code will download the content in /tmp/ directory. If you want you can change the directory.