I encounter S3 SignatureDoesNotMatch
while trying to write Dataframe to S3 with Spark.
The symptom/things have tried:
- The code fail sometimes but works sometimes;
- The code can read from S3 without any problem, and be able to write to S3 from time to time, which rules out wrong config settings like S3A / enableV4 / Wrong Key / Region Endpoint etc.
- The S3A endpoint had been set according to the S3 docs S3 Endpoint;
- Made sure the
AWS_SECRETY_KEY
does not contain any non-alphanumeric as per suggested here; - Made sure server time is in-sync by using NTP;
- The following was tested on EC2
m3.xlarge
withspark-2.0.2-bin-hadoop2.7
running on Local mode; - The issue is gone when the files are written to local fs;
- right now the workaround was to mount the bucket with s3fs and write to there; however this is not ideal as s3fs dies quite often from the stress Spark put to it;
The code can be boiled down to:
spark-submit\
--verbose\
--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\
--packages org.apache.hadoop:hadoop-aws:2.7.3\
--driver-java-options '-Dcom.amazonaws.services.s3.enableV4'\
foobar.py
# foobar.py
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 's3.dualstack.ap-southeast-2.amazonaws.com')
hc = SparkSession.builder.enableHiveSupport().getOrCreate()
dataframe = hc.read.parquet(in_file_path)
dataframe.write.csv(
path=out_file_path,
mode='overwrite',
compression='gzip',
sep=',',
quote='"',
escape='\\',
escapeQuotes='true',
)
Spark spills the following error.
Set log4j to verbose, it appears the following had happened:
- Each individual will be output to staing location on S3
/_temporary/foorbar.part-xxx
; - A PUT call will move the partitions into final location;
- After a few successfully PUT calls, all the subsequent PUT call failed due to 403;
- As the reuqets were made by aws-java-sdk, not sure what to do on application level; -- The following log were from another event with the exact same error;
>> PUT XXX/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet HTTP/1.1
>> Host: XXX.s3-ap-southeast-2.amazonaws.com
>> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
>> X-Amz-Date: 20161104T005749Z
>> x-amz-metadata-directive: REPLACE
>> Connection: close
>> User-Agent: aws-sdk-java/1.10.11 Linux/3.13.0-100-generic OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91 com.amazonaws.services.s3.transfer.TransferManager/1.10.11
>> x-amz-server-side-encryption-aws-kms-key-id: 5f88a222-715c-4a46-a64c-9323d2d9418c
>> x-amz-server-side-encryption: aws:kms
>> x-amz-copy-source: /XXX/_temporary/0/task_201611040057_0001_m_000025/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet
>> Accept-Ranges: bytes
>> Authorization: AWS4-HMAC-SHA256 Credential=AKIAJZCSOJPB5VX2B6NA/20161104/ap-southeast-2/s3/aws4_request, SignedHeaders=accept-ranges;connection;content-length;content-type;etag;host;last-modified;user-agent;x-amz-content-sha256;x-amz-copy-source;x-amz-date;x-amz-metadata-directive;x-amz-server-side-encryption;x-amz-server-side-encryption-aws-kms-key-id, Signature=48e5fe2f9e771dc07a9c98c7fd98972a99b53bfad3b653151f2fcba67cff2f8d
>> ETag: 31436915380783143f00299ca6c09253
>> Content-Type: application/octet-stream
>> Content-Length: 0
DEBUG wire: << "HTTP/1.1 403 Forbidden[\r][\n]"
DEBUG wire: << "x-amz-request-id: 849F990DDC1F3684[\r][\n]"
DEBUG wire: << "x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=[\r][\n]"
DEBUG wire: << "Content-Type: application/xml[\r][\n]"
DEBUG wire: << "Transfer-Encoding: chunked[\r][\n]"
DEBUG wire: << "Date: Fri, 04 Nov 2016 00:57:48 GMT[\r][\n]"
DEBUG wire: << "Server: AmazonS3[\r][\n]"
DEBUG wire: << "Connection: close[\r][\n]"
DEBUG wire: << "[\r][\n]"
DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 403 Forbidden
<< HTTP/1.1 403 Forbidden
<< x-amz-request-id: 849F990DDC1F3684
<< x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=
<< Content-Type: application/xml
<< Transfer-Encoding: chunked
<< Date: Fri, 04 Nov 2016 00:57:48 GMT
<< Server: AmazonS3
<< Connection: close
DEBUG requestId: x-amzn-RequestId: not available
One problem that Spark is going to have with S3, irrespective of driver, is that it's an eventually consistent object store, where: renames take O(bytes) to complete, and the delayed consistency between PUT and LIST can break the commit. More succintly: Spark assumes that after you write something to a filesystem, if you do an ls of the parent directory, you find the something you just wrote. S3 doesn't offer that, hence the term "eventually consistency". Now, in HADOOP-13786 we are trying to better, and HADOOP-13345 see if we can't use Amazon Dynamo for a faster, consistent view of the world. But you will have to pay the dynamodb premium for that feature.
Finally, everything currently known about s3a troubleshooting, including possible causes of 403 errors, is online. Hopefully it'll help, and, if there's another cause you identify, patches are welcome
I had the same issue and resolved it by upgrading from
aws-java-sdk:1.7.4
toaws-java-sdk:1.11.199
andhadoop-aws:2.7.7
tohadoop-aws:3.0.0
.However to avoid the dependency mismatches when interacting with AWS I had to rebuild Spark and provide it with my own version of Hadoop 3.0.0.
I speculate that the root cause is the way that the v4 signature algorithm takes in the current timestamp and then all Spark executors are using the same signature to authenticate their PUT requests. But if one slips outside the 'window' of time allowed by the algorithm the request, and all further requests, fail causing Spark to rollback the the changes and error out. This explains why calling
.coalesce(1)
or.repartition(1)
always works but the failure rate climbs in proportion to the number of partitions being written.I experienced exactly the same problem and found a solution with the help of this article (other resources are pointing in the same direction). After setting these configuration options, writing to S3 succeeded:
I am using Spark 2.1.1 with Hadoop 2.7. My final spark-submit command looked like this:
Additionally, I defined these environment variables: