How can I make Apache Spark use multipart uploads when saving data to Amazon S3. Spark writes data using RDD.saveAs...File
methods. when the destination is start with s3n://
Spark automatically uses JetS3Tt to do the upload, but this fails for files larger than 5G. Large files need to be uploaded to S3 using multipart upload, which is supposed to be beneficial for smaller files as well. Multipart uploads are supported in JetS3Tt with MultipartUtils
, but Spark does not use this in the default configuration. Is there a way to make it use this functionality.
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Upload file to Google Cloud Storage using AngularJ
相关文章
- Livy Server: return a dataframe as JSON?
- how many objects are returned by aws s3api list-ob
- AWS S3 in rails - how to set the s3_signature_vers
- File Upload of more than 4GB
- The current request is not a multipart request - S
- PUT to S3 with presigned url gives 403 error
- SQL query Frequency Distribution matrix for produc
- php - unlink throws error: Resource temporarily un
s3n seems to be on deprecation path.
From their documentation
This is a limitation of s3n, you can use the new s3a protocol to access your files in S3. s3a is based on aws-adk library and support much of the features including multipart upload. More details in this link: