S3A: fails while S3: works in Spark EMR

2020-07-16 03:03发布

问题:

I'm using EMR 5.5.0 with Spark. If I write a simple file to s3 using an s3://... URL it writes fine. But if I use an s3a://... address, it fails with Service: Amazon S3; Status Code: 403; Error Code: AccessDenied

Using the AWS command line I'm able to cp, mv, and rm any file in the path I'm writing to. But from spark, s3a fails on the put command.

We have Server Side Encryption Enabled, and I know spark knows because the s3 URLs work. Any ideas?

Failed PUT DEBUG logs here. Maybe its important to note, I'm doing an rdd.saveAsTextFile(path) but the put command says its trying to write to /my-bucket/tmp/carlos/testWrite/4/_temporary/0/ which it should only do in parquet? Not sure if that detail is relevant but thought I would mention.

回答1:

s3a is the actively maintained S3 client in Apache Hadoop. AWS forked their own client off from the Apache s3n:// client many years ago & (presumably) have massively reworked theirs.

They can read and write the same data, but some bits of EMR expect extra methods in the filesystem client which only EMR s3 supports...you cannot safely use s3a.

There's also the original ASF s3:// client which is incompatible with everything else, but was the first code used to connect Hadoop with S3, way before EMR was a product from amazon.

Which is better? S3A is probably, as of Aug 2017, faster on aggressive read IO of columnar formats like ORC and Parquet. EMR S3, with emrfs probably has the edge in terms of resilience and consistency. But the open source ASF S3A client is moving to address those



回答2:

Turns out EMR does not support s3a protocol at all as of today. In addition, it says s3 and s3n are interchangable, but you should use s3

https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/ http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

One thing to Note is that s3a although not supported, seems to work for reading, but not writing.

Update May 29, 2018:

Just to give a fuller answer, the s3a protocol is supported with s3+emr if you're using them with Databricks.

https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html