Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including:
deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive dependency conflicts
Has anyone successfully make both work?
I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3
Copy the AWS jars(
hadoop-aws-2.7.3.jar
andaws-java-sdk-1.7.4.jar
) which shipped with Hadoop by defaultHint: If the jar locations are unsure? running find command as privileged user can be helpful, commands can be..
into spark classpath which holds all spark jars
Hint: We can not directly point the location(It must be in property file) as I want make answer generic for distributions and Linux flavors. spark classpath can be identified by find command below
in
spark-defaults.conf
Hint: (Mostly it will be placed in
/etc/spark/conf/spark-defaults.conf
)in spark submit include jars(
aws-java-sdk
andhadoop-aws
) in--driver-class-path
if needed.I got it working using the Spark 1.4.1 prebuilt binary with hadoop 2.6 Make sure you set both
spark.driver.extraClassPath
andspark.executor.extraClassPath
pointing to the two jars (hadoop-aws and aws-java-sdk) If you run on a cluster, make sure your executors have access to the jar files on the cluster.Here are the details as of October 2016, as presented at Spark Summit EU: Apache Spark and Object Stores.
Key points
Product placement: the read-performance side of HADOOP-11694 is included in HDP2.5; The Spark and S3 documentation there might be of interest —especially the tuning options.
I am using spark version 2.3, and when I save a dataset using spark like:
It works perfectly and saves my data into s3.
as you said, hadoop 2.6 doesn't support s3a, and latest spark release 1.6.1 doesn't support hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.
for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support
if you still want to try to use s3a in spark 1.6.x, refer to the answer here: https://stackoverflow.com/a/37487407/5630352
Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
Here are the key parts, as of December 2015:
Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify
--hadoop-major-version 2
(which uses CDH 4.2 as of this writing).You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class
org.apache.hadoop.fs.s3a.S3AFileSystem
.In
spark.properties
you probably want some settings that look like this:I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.