how to add external jar to hadoop job?

I have a Hadoop job in which the mapper must use an external jar.

I tried to pass this jar to the mapper's JVM

via the -libjars argument on the hadoop command

hadoop jar mrrunner.jar DAGMRRunner -libjars <path_to_jar>/colt.jar

via job.addFileToClassPath

job.addFileToClassPath(new Path("<path_to_jar>/colt.jar"));

on HADOOP_CLASSPATH.

g1mihai@hydra:/home/g1mihai/$ echo $HADOOP_CLASSPATH
<path_to_jar>/colt.jar

None of these methods work. This is the stack trace I get back. The missing class it complains about is SparseDoubleMatrix1D is in colt.jar.

Let me know if I should provide any additional debug info. Thanks.

15/02/14 16:47:51 INFO mapred.MapTask: Starting flush of map output
15/02/14 16:47:51 INFO mapred.LocalJobRunner: map task executor complete.
15/02/14 16:47:51 WARN mapred.LocalJobRunner: job_local368086771_0001
java.lang.Exception: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
        at java.lang.Class.getDeclaredFields0(Native Method)
        at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
        at java.lang.Class.getDeclaredField(Class.java:1951)
        at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
        at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at BoostConnector.ConnectCalculateBoost(BoostConnector.java:39)
        at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:46)
        at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:22)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: cern.colt.matrix.impl.SparseDoubleMatrix1D
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 28 more

标签： java hadoop mapreduce noclassdeffounderror

2条回答

唯我独甜

2楼-- · 2019-01-15 17:18

Use distributed cache - you can have any executable files, or small reference files in the cache and use it in your MR job.

https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html

There are two ways of running MR job, one with class name at run time else mention the main class at the time of exporting the jar.

hadoop jar jarname.jar DriverClassName Input-Location Output-Location
hadoop jar jarname.jar Input-Location Output-Location

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-01-15 17:23

I believe that this question deserves a detailed answer, I was stuck with this yesterday and wasted a lot of time. I hope this answer helps everyone who happen to run into this. There are couple of options to fix this issue:

Include the external jar (dependency JAR) as part of your application jar file. You can easily do this using Eclipse. The disadvantage of this option is that it will bloat up your application jar and your MapRed job will take much more time to get executed. Every time your dependency version changes you will have to recompile the application etc. It's better not to go this route.
Using "Hadoop classpath" - On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.

The option that I made use of was specifying the -LIBJARS with the Hadoop jar command. First make sure that you edit your driver class:

public class myDriverClass extends Configured implements Tool {

  public static void main(String[] args) throws Exception {
     int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
     System.exit(res);
  }

  public int run(String[] args) throws Exception
  {

    // Configuration processed by ToolRunner 
    Configuration conf = getConf();
    Job job = new Job(conf, "My Job");

    ...
    ...

    return job.waitForCompletion(true) ? 0 : 1;
  }
}

Now edit your "hadoop jar" command as shown below:

hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file

Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.

Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.

Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.

Configuration conf = getConf();

0人赞添加讨论(0) 举报

how to add external jar to hadoop job?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间