aws access s3 from spark using IAM role

2020-07-30 00:27发布

问题:

I want to access s3 from spark, I don't want to configure any secret and access keys, I want to access with configuring the IAM role, so I followed the steps given in s3-spark

But still it is not working from my EC2 instance (which is running standalone spark)

it works when I tested

[ec2-user@ip-172-31-17-146 bin]$ aws s3 ls s3://testmys3/
2019-01-16 17:32:38        130 e.json

but it did not work when I tried like below

scala> val df = spark.read.json("s3a://testmys3/*")

I am getting the below error

19/01/16 18:23:06 WARN FileStreamSink: Error while looking for metadata directory.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: E295957C21AFAC37, AWS Error Code: null, AWS Error Message: Bad Request
  at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
  at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
  at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
  at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)

回答1:

"400 Bad Request" is fairly unhelpful, and not only does S3 not provide much, the S3A connector doesn't date print much related to auth either. There's a big section on troubleshooting the error

The fact it got as far as making a request means that it has some credentials, only the far end doesn't like them

Possibilities

  • your IAM role doesn't have the permissions for s3:ListBucket. See IAM role permissions for working with s3a
  • your bucket name is wrong
  • There's some settings in fs.s3a or the AWS_ env vars which get priority over the IAM role, and they are wrong.

You should automatically have IAM auth as an authentication mechanism with the S3A connector; its the one which is checked last after: config & env vars.

  1. Have a look at what is set in fs.s3a.aws.credentials.provider -it must be unset or contain the option com.amazonaws.auth.InstanceProfileCredentialsProvider
  2. assuming you also have hadoop on the command line, grab storediag
hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag s3a://testmys3/

it should dump what it is up to regarding authentication.

Update

As the original poster has commented, it was due to v4 authentication being required on the specific S3 endpoint. This can be enabled on the 2.7.x version of the s3a client, but only via Java system properties. For 2.8+ there are some fs.s3a. options you can set it instead



回答2:

this config worked

    ./spark-shell \
        --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
        --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com \
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
        --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true \
        --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true