Multipart uploads to Amazon S3 from Apache Spark

2019-06-25 13:46发布

How can I make Apache Spark use multipart uploads when saving data to Amazon S3. Spark writes data using RDD.saveAs...File methods. when the destination is start with s3n:// Spark automatically uses JetS3Tt to do the upload, but this fails for files larger than 5G. Large files need to be uploaded to S3 using multipart upload, which is supposed to be beneficial for smaller files as well. Multipart uploads are supported in JetS3Tt with MultipartUtils, but Spark does not use this in the default configuration. Is there a way to make it use this functionality.

标签： file-upload amazon-s3 apache-spark jets3t

2条回答

何必那么认真

2楼-- · 2019-06-25 14:20

s3n seems to be on deprecation path.

From their documentation

Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-06-25 14:25

This is a limitation of s3n, you can use the new s3a protocol to access your files in S3. s3a is based on aws-adk library and support much of the features including multipart upload. More details in this link:

0人赞添加讨论(0) 举报

Multipart uploads to Amazon S3 from Apache Spark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间