I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster.
I have added the aws credentials in core-site.xml:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>some key</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>some key</value>
</property>
Note: Since there are some slashes on the key, I have escaped them with %2F
If I try to list the contents of the bucket:
hadoop fs -ls s3://some-url/bucket/
I get this error:
ls: No FileSystem for scheme: s3
I edited core-site.xml again, and added information related to the fs:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
This time I get a different error:
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.
If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari.
mapreduce.application.classpath
Append the following value to the end of the existing string:
/usr/hdp/${hdp.version}/hadoop-mapreduce/*
@Ashrith's answer worked for me with one modification: I had to use
$HADOOP_PREFIX
rather than$HADOOP_HOME
when running v2.6 on Ubuntu. Perhaps this is because it sounds like$HADOOP_HOME
is being deprecated?export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP_PREFIX}/share/hadoop/tools/lib/*
Having said that, neither worked for me on my Mac with v2.6 installed via Homebrew. In that case, I'm using this extremely cludgy export:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(brew --prefix hadoop)/libexec/share/hadoop/tools/lib/*
To resolve this issue I tried all the above, which failed (for my environment anyway).
However I was able to get it working by copying the two jars mentioned above from the tools dir and into common/lib.
Worked fine after that.
For some reason, the jar
hadoop-aws-[version].jar
which contains the implementation toNativeS3FileSystem
is not present in theclasspath
of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line inhadoop-env.sh
which is located in$HADOOP_HOME/etc/hadoop/hadoop-env.sh
:By the way, you could check the classpath of Hadoop using: