I have a map reduce job which gets its input from DocumentDB. I've added to jar files under the lib directory in my source code and also user the -libjars when running the job. but I still get the class not found error for a class in the jar file. Here is some part of my driver program
public class MapReduceDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapReduceDriver(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
....
When using the -libjars I once put the required jar files on the local driver and once on the hdfs but neither worked. How can I make sure that the -libjars works?
p.s. I'm using 2-node HDInsight cluster (running in Microsoft Azure).
Here is the error message I get
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.microsoft.azure.documentdb.hadoop.DocumentDBInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1961)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.ClassNotFoundException: Class com.microsoft.azure.documentdb.hadoop.DocumentDBInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1867)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1959)
... 8 more
HDInsight is using templton which doesn't have support for libjars, so you can't use that templton docs
Also, I'm assuming you are building a custom HDInsight cluster using a powershell script. You can copy all the jars with dependencies to HADOOP_HOME + '\share\hadoop\common\lib this would be the hadoop lib folder.
Or you can directly use the powershell script published with changing the path that contains the dependency jars ( add your jars to an azure blob contains and just replace the path ) powershell script
I assume you are referring to the DocumentDB Hadoop connector jar found here: https://github.com/Azure/azure-documentdb-hadoop
The jar does not include dependencies. You can either have maven to retrieve dependencies for you, or manually download and include in the build path yourself.
Here are the dependencies: