How can I use boto to stream a file out of Amazon

2019-01-07 09:53发布

问题:

I'm copying a file from S3 to Cloudfiles, and I would like to avoid writing the file to disk. The Python-Cloudfiles library has an object.stream() call that looks to be what I need, but I can't find an equivalent call in boto. I'm hoping that I would be able to do something like:

shutil.copyfileobj(s3Object.stream(),rsObject.stream())

Is this possible with boto (or I suppose any other s3 library)?

回答1:

The Key object in boto, which represents on object in S3, can be used like an iterator so you should be able to do something like this:

>>> import boto
>>> c = boto.connect_s3()
>>> bucket = c.lookup('garnaat_pub')
>>> key = bucket.lookup('Scan1.jpg')
>>> for bytes in key:
...   write bytes to output stream

Or, as in the case of your example, you could do:

>>> shutil.copyfileobj(key, rsObject.stream())


回答2:

Other answers in this thread are related to boto, but S3.Object is not iterable anymore in boto3. So, the following DOES NOT WORK, it produces an TypeError: 's3.Object' object is not iterable error message:

    s3 = boto3.session.Session(profile_name=my_profile).resource('s3')
    s3_obj = s3.Object(bucket_name=my_bucket, key=my_key)

    with io.FileIO('sample.txt', 'w') as file:
        for i in s3_obj:
            file.write(i)

In boto3, the contents of the object is available at S3.Object.get()['Body'] which is not an iterable either, so the following still DOES NOT WORK:

    body = s3_obj.get()['Body']
    with io.FileIO('sample.txt', 'w') as file:
        for i in body:
            file.write(i)

So, an alternative is to use the read method, but this loads the WHOLE S3 object in memory which when dealing with large files is not always a possibility:

    body = s3_obj.get()['Body']
    with io.FileIO('sample.txt', 'w') as file:
        for i in body.read():
            file.write(i)

But the read method allows to pass in the amt parameter specifying the number of bytes we want to read from the underlying stream. This method can be repeatedly called until the whole stream has been read:

    body = s3_obj.get()['Body']
    with io.FileIO('sample.txt', 'w') as file:
        while file.write(body.read(amt=512)):
            pass

Digging into botocore.response.StreamingBody code one realizes that the underlying stream is also available, so we could iterate as follows:

    body = s3_obj.get()['Body']
    with io.FileIO('sample.txt', 'w') as file:
        for b in body._raw_stream:
            file.write(b)

While googling I've also seen some links that could be use, but I haven't tried:

  • WrappedStreamingBody
  • Another related thread
  • An issue in boto3 github to request StreamingBody is a proper stream - which has been closed!!!


回答3:

I figure at least some of the people seeing this question will be like me, and will want a way to stream a file from boto line by line (or comma by comma, or any other delimiter). Here's a simple way to do that:

def getS3ResultsAsIterator(self, aws_access_info, key, prefix):        
    s3_conn = S3Connection(**aws_access)
    bucket_obj = s3_conn.get_bucket(key)
    # go through the list of files in the key
    for f in bucket_obj.list(prefix=prefix):
        unfinished_line = ''
        for byte in f:
            byte = unfinished_line + byte
            #split on whatever, or use a regex with re.split()
            lines = byte.split('\n')
            unfinished_line = lines.pop()
            for line in lines:
                yield line

@garnaat's answer above is still great and 100% true. Hopefully mine still helps someone out.



回答4:

This is my solution of wrapping streaming body:

import io
class S3ObjectInterator(io.RawIOBase):
    def __init__(self, bucket, key):
        """Initialize with S3 bucket and key names"""
        self.s3c = boto3.client('s3')
        self.obj_stream = self.s3c.get_object(Bucket=bucket, Key=key)['Body']

    def read(self, n=-1):
        """Read from the stream"""
        return self.obj_stream.read() if n == -1 else self.obj_stream.read(n)

Example usage:

obj_stream = S3ObjectInterator(bucket, key)
for line in obj_stream:
    print line


回答5:

Botocore's StreamingBody has an iter_lines() method:

https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.iter_lines

So:

import boto3
s3r = boto3.resource('s3')
iterator = s3r.Object(bucket, key).get()['Body'].iter_lines()

for line in iterator:
    print(line)