可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Sometimes my MR job complains that MyMapper class in not found. And that i have to give job.setJarByClass(MyMapper.class); to tell it to load it from my jar file.

cloudera@cloudera-vm:/tmp/translator$ hadoop jar MapReduceJobs.jar translator/input/Portuguese.txt translator/output 13/06/13 03:36:57 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 13/06/13 03:36:57 INFO input.FileInputFormat: Total input paths to process : 1 13/06/13 03:36:57 INFO mapred.JobClient: Running job: job_201305100422_0043 13/06/13 03:36:58 INFO mapred.JobClient: map 0% reduce 0% 13/06/13 03:37:03 INFO mapred.JobClient: Task Id : attempt_201305100422_0043_m_000000_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: com.mapreduce.variousformats.keyvaluetextinputformat.MyMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:601)

Question : Why does it happen. Why doesn't it always tell me to load it from my jar file. Is there some best practices for tackling these kind of issues. Also if i am using some 3rd party libraries , will i have to do this for them as well.

回答1:

Be sure to add any dependencies to both the HADOOP_CLASSPATH and -libjars upon submitting a job like in the following examples:

Use the following to add all the jar dependencies from (for example) current and lib directories:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`echo *.jar`:`echo lib/*.jar | sed 's/ /:/g'`

Bear in mind that when starting a job through hadoop jar you'll need to also pass it the jars of any dependencies through use of -libjars. I like to use:

hadoop jar <jar> <class> -libjars `echo ./lib/*.jar | sed 's/ /,/g'` [args...]

NOTE: The sed commands require a different delimiter character; the HADOOP_CLASSPATH is : separated and the -libjars need to be , separated.

回答2:

Yes, job.setJarByClass is necessary. So that hadoop will copy your jar to the task trackers. If you does not invoke job.setJarByClass, hadoop will think your jar is in the classpath of task trackers, so it does not copy your jar.