Trying to read a file located in S3 using spark-shell:
scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12
scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
... etc ...
The IOException: No FileSystem for scheme: s3n error occurred with:
- Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
- Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
- Using s3:// or s3n:// scheme
What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()
?
Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.
I was facing the same issue. It worked fine after setting the value for fs.s3n.impl and adding hadoop-aws dependency.
This is a sample spark code which can read the files present on s3
I had to copy the jar files from a hadoop download into the
$SPARK_HOME/jars
directory. Using the--jars
flag or the--packages
flag for spark-submit didn't work.Details:
(hadoop dir)/share/hadoop/tools/lib/
USe s3a instead of s3n. I had similar issue on a Hadoop job. After switching from s3n to s3a it worked.
e.g.
s3a://myBucket/myFile1.log
You probably have to use s3a:/ scheme instead of s3:/ or s3n:/ However, it is not working out of the box (for me) for the spark shell. I see the following stacktrace:
What I think - you have to manually add the hadoop-aws dependency manually http://search.maven.org/#artifactdetails|org.apache.hadoop|hadoop-aws|2.7.1|jar But I have no idea how to add it to spark-shell properly.
Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.
sc.textFile("s3n://bucketname/Filename")
now raises another error:The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).
Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.