setJarByClass() in Hadoop

2020-03-26 07:40发布

问题:

At some point in the driver method of an Hadoop algorithm we link the job to the references of the classes set as Mapper and Reducer. For example:

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

usually the driver method is the main while mapper and reducer are implemented as inner static classes.

Suppose that MyMapper.class and MyReducer.class are inner static classes of MyClass.class and that driver method is the main of MyClass.class. Sometime I see the following line added right after the two from above:

        job.setJarByClass(Myclass.class);

what is the meaning of this configuration step and when it is useful or mandatory?

In my case (I have a single-node cluster installation), If I remove this line, I can continue to run the job correctly. Why?

回答1:

Here we help Hadoop to find out that which jar it should send to nodes to perform Map and Reduce tasks. Our abc-jar.jar might have various other jars in it's classpath, also our driver code might be in a separate jar or location than that of our Mapper and Reducer classes.

Hence, using this setJarByClass method we tell Hadoop to find out the relevant jar by finding out that the class specified as it's parameter to be present as part of that jar. So usually we should provide either MapperImplementation.class or your Reducer implementation or any other class which is present in the same jar as that of Mapper and Reducer. Also make sure that both Mapper and Reducer are part of the same jar.

Ref: http://www.bigdataspeak.com/2014/06/what-is-need-to-use-jobsetjarbyclass-in.html