Spark Write to S3 V4 SignatureDoesNotMatch Error

2020-08-11 04:14发布

问题:

I encounter S3 SignatureDoesNotMatch while trying to write Dataframe to S3 with Spark.

The symptom/things have tried:

  • The code fail sometimes but works sometimes;
  • The code can read from S3 without any problem, and be able to write to S3 from time to time, which rules out wrong config settings like S3A / enableV4 / Wrong Key / Region Endpoint etc.
  • The S3A endpoint had been set according to the S3 docs S3 Endpoint;
  • Made sure the AWS_SECRETY_KEY does not contain any non-alphanumeric as per suggested here;
  • Made sure server time is in-sync by using NTP;
  • The following was tested on EC2 m3.xlarge with spark-2.0.2-bin-hadoop2.7 running on Local mode;
  • The issue is gone when the files are written to local fs;
  • right now the workaround was to mount the bucket with s3fs and write to there; however this is not ideal as s3fs dies quite often from the stress Spark put to it;

The code can be boiled down to:

spark-submit\
    --verbose\
    --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \
    --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\
    --packages org.apache.hadoop:hadoop-aws:2.7.3\
    --driver-java-options '-Dcom.amazonaws.services.s3.enableV4'\
    foobar.py


# foobar.py
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 's3.dualstack.ap-southeast-2.amazonaws.com')

hc = SparkSession.builder.enableHiveSupport().getOrCreate()
dataframe = hc.read.parquet(in_file_path)

dataframe.write.csv(
    path=out_file_path,
    mode='overwrite',
    compression='gzip',
    sep=',',
    quote='"',
    escape='\\',
    escapeQuotes='true',
)

Spark spills the following error.


Set log4j to verbose, it appears the following had happened:

  • Each individual will be output to staing location on S3 /_temporary/foorbar.part-xxx;
  • A PUT call will move the partitions into final location;
  • After a few successfully PUT calls, all the subsequent PUT call failed due to 403;
  • As the reuqets were made by aws-java-sdk, not sure what to do on application level; -- The following log were from another event with the exact same error;

 >> PUT XXX/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet HTTP/1.1
 >> Host: XXX.s3-ap-southeast-2.amazonaws.com
 >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
 >> X-Amz-Date: 20161104T005749Z
 >> x-amz-metadata-directive: REPLACE
 >> Connection: close
 >> User-Agent: aws-sdk-java/1.10.11 Linux/3.13.0-100-generic OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91 com.amazonaws.services.s3.transfer.TransferManager/1.10.11
 >> x-amz-server-side-encryption-aws-kms-key-id: 5f88a222-715c-4a46-a64c-9323d2d9418c
 >> x-amz-server-side-encryption: aws:kms
 >> x-amz-copy-source: /XXX/_temporary/0/task_201611040057_0001_m_000025/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet
 >> Accept-Ranges: bytes
 >> Authorization: AWS4-HMAC-SHA256 Credential=AKIAJZCSOJPB5VX2B6NA/20161104/ap-southeast-2/s3/aws4_request, SignedHeaders=accept-ranges;connection;content-length;content-type;etag;host;last-modified;user-agent;x-amz-content-sha256;x-amz-copy-source;x-amz-date;x-amz-metadata-directive;x-amz-server-side-encryption;x-amz-server-side-encryption-aws-kms-key-id, Signature=48e5fe2f9e771dc07a9c98c7fd98972a99b53bfad3b653151f2fcba67cff2f8d
 >> ETag: 31436915380783143f00299ca6c09253
 >> Content-Type: application/octet-stream
 >> Content-Length: 0
DEBUG wire:  << "HTTP/1.1 403 Forbidden[\r][\n]"
DEBUG wire:  << "x-amz-request-id: 849F990DDC1F3684[\r][\n]"
DEBUG wire:  << "x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=[\r][\n]"
DEBUG wire:  << "Content-Type: application/xml[\r][\n]"
DEBUG wire:  << "Transfer-Encoding: chunked[\r][\n]"
DEBUG wire:  << "Date: Fri, 04 Nov 2016 00:57:48 GMT[\r][\n]"
DEBUG wire:  << "Server: AmazonS3[\r][\n]"
DEBUG wire:  << "Connection: close[\r][\n]"
DEBUG wire:  << "[\r][\n]"
DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 403 Forbidden
 << HTTP/1.1 403 Forbidden
 << x-amz-request-id: 849F990DDC1F3684
 << x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=
 << Content-Type: application/xml
 << Transfer-Encoding: chunked
 << Date: Fri, 04 Nov 2016 00:57:48 GMT
 << Server: AmazonS3
 << Connection: close
DEBUG requestId: x-amzn-RequestId: not available

回答1:

I experienced exactly the same problem and found a solution with the help of this article (other resources are pointing in the same direction). After setting these configuration options, writing to S3 succeeded:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false

I am using Spark 2.1.1 with Hadoop 2.7. My final spark-submit command looked like this:

spark-submit
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
--conf spark.hadoop.fs.s3a.endpoint=s3.eu-central-1.amazonaws.com
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
...

Additionally, I defined these environment variables:

AWS_ACCESS_KEY_ID=****
AWS_SECRET_ACCESS_KEY=****


回答2:

I had the same issue and resolved it by upgrading from aws-java-sdk:1.7.4 to aws-java-sdk:1.11.199 and hadoop-aws:2.7.7 to hadoop-aws:3.0.0.

However to avoid the dependency mismatches when interacting with AWS I had to rebuild Spark and provide it with my own version of Hadoop 3.0.0.

I speculate that the root cause is the way that the v4 signature algorithm takes in the current timestamp and then all Spark executors are using the same signature to authenticate their PUT requests. But if one slips outside the 'window' of time allowed by the algorithm the request, and all further requests, fail causing Spark to rollback the the changes and error out. This explains why calling .coalesce(1) or .repartition(1) always works but the failure rate climbs in proportion to the number of partitions being written.



回答3:

  1. What do you mean "s3a" dies? I'm curious about that. If you have stack traces, file them on the Apache JIRA server, project HADOOP, component fs/s3.
  2. s3n doesn't support v4 API. it's not a matter of endpoint, but of the new signature mech. It's not going to have its jets3t library upgraded except for security reasons, so stop trying to work with it.

One problem that Spark is going to have with S3, irrespective of driver, is that it's an eventually consistent object store, where: renames take O(bytes) to complete, and the delayed consistency between PUT and LIST can break the commit. More succintly: Spark assumes that after you write something to a filesystem, if you do an ls of the parent directory, you find the something you just wrote. S3 doesn't offer that, hence the term "eventually consistency". Now, in HADOOP-13786 we are trying to better, and HADOOP-13345 see if we can't use Amazon Dynamo for a faster, consistent view of the world. But you will have to pay the dynamodb premium for that feature.

Finally, everything currently known about s3a troubleshooting, including possible causes of 403 errors, is online. Hopefully it'll help, and, if there's another cause you identify, patches are welcome