Extract .gz files in S3 automatically

2019-08-03 22:40发布

问题:

I'm trying to find a solution to extract ALB logs file in .gz format when they're uploaded automatically from ALB to S3.

My bucket structure is like this

/log-bucket
..alb-1/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-2/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-3/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz

Basically, every 5 minutes, each ALB would automatically push logs to correspond S3 bucket. I'd like to extract new .gz files right at that time in same bucket.

Is there any ways to handle this?

I noticed that we can use Lambda function but not sure where to start. A sample code would be greatly appreciated!

回答1:

Your best choice would probably be to have an AWS Lambda function subscribed to S3 events. Whenever a new object gets created, this Lambda function would be triggered. The Lambda function could then read the file from S3, extract it, write the extracted data back to S3 and delete the original one.

How that works is described in Using AWS Lambda with Amazon S3.

That said, you might also want to reconsider if you really need to store uncompressed logs in S3. Compressed files are not only cheaper, as they don't take as much storage space as uncompressed ones, but they are usually also faster to process, as the bottleneck in most cases is network bandwidth for transferring the data and not available CPU-resources for decompression. Most tools also support working directly with compressed files. Take Amazon Athena (Compression Formats) or Amazon EMR (How to Process Compressed Files) for example.