How to extract files in S3 on the fly with boto3?

2019-06-25 23:09发布

问题:

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.

With boto3 + lambda, how can i achieve my goal?

I didn't see any extract part in boto3 document.

回答1:

Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.

However, you could use an AWS Lambda function to retrieve an object from S3, unzip it, then upload content back up again. However, please note that there is limit of 500MB in temporary disk space for Lambda, so avoid unzipping too much data.

You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:

  • Use boto3 (assuming you like Python) to download the new file
  • Use the zipfile Python library to extract files
  • Use boto3 to upload the resulting file(s)

Sample code

import boto3

s3 = boto3.client('s3', use_ssl=False)
s3.upload_fileobj(
    Fileobj=gzip.GzipFile(
        None,
        'rb',
        fileobj=BytesIO(
            s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read())),
    Bucket=bucket,
    Key=uncompressed_key)


回答2:

You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.

# python imports
import boto3
from io import BytesIO

# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'

# initialize s3 client, this is dependent upon your aws config being done 
s3 = boto3.client('s3', use_ssl=False)  # optional
s3.upload_fileobj(                      # upload a new obj to s3
    Fileobj=gzip.GzipFile(              # read in the output of gzip -d
        None,                           # just return output as BytesIO
        'rb',                           # read binary
        fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read())),
    Bucket=bucket,                      # target bucket, writing to
    Key=uncompressed_key)               # target key, writing to

Ensure that your key is reading in correctly:

# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s))  # check to ensure some data was returned