Optimize S3 download for large number of tiny file

2019-07-21 15:30发布

问题:

I currently use TransferManager to download all files in an S3 bucket, from a Lambda function.

// Initialize
TransferManagerBuilder txBuilder = TransferManagerBuilder.standard();
// txBuilder.setExecutorFactory(() -> Executors.newFixedThreadPool(50));
TransferManager tx = txBuilder.build();
final Path tmpDir = Files.createTempDirectory("/tmp/s3_download/");

// Download
MultipleFileDownload download = tx.downloadDirectory(bucketName,
                                                     bucketKey,
                                                     new File(tmpDir.toUri()));
download.waitForCompletion();

return Files.list(tmpDir.resolve(bucketKey)).collect(Collectors.toList());

It seems to take around 300 seconds to download 10,000 files (of size ~20KB each), giving me a transfer rate of about 666 KBps. Increasing the thread pool size doesn't seem to affect the transfer rate at all.

The S3 endpoint, and the lambda function are in the same AWS region, and in the same AWS account.

How can I optimize the S3 downloads?

回答1:

Dealing with a large number of data always needs architecting your storage with regards to underlying systems.

If you need high throughputs you need to partition your s3 keys so that it can accommodate a high number of requests. Distributed computing comes with own needs to serve with high performance, this is one such need.

Request Rate Considerations:

https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Transfer Acceleration:

https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html

How to Improve throughput:

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

Hope it helps.

EDIT1

I see that you are trying to download files to Ephemeral storage, you need to be aware of storage limits. Those are not meant for bulk processing.

https://docs.aws.amazon.com/lambda/latest/dg/limits.html