I am trying to run 2 independent mappers on the same input file in a hadoop program using one job. I want the output of both the mappers to go into a single reducer. I face an issue with running multiple mappers. I was using MultipleInputs class. It was working fine by running both the mappers but yesterday i noticed that it is running only one map function that is the second MultipleInputs statement seems to overwrite the first one. I dont find any change done to the code to show this different behavior suddenly :( Please help me in this.. The main function is :
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "mapper accepting whole file at once");
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(TestMultipleInputs.class);
job.setMapperClass(Map2.class);
job.setMapperClass(Map1.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(NLinesInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
** MultipleInputs.addInputPath(job, new Path("Rec"), NLinesInputFormat.class, Map1.class);
MultipleInputs.addInputPath(job, new Path("Rec"), NLinesInputFormat.class, Map2.class);**
FileOutputFormat.setOutputPath(job,new Path("testMulinput"));
job.waitForCompletion(true);
}
Whichever Map class is used in the last MultipleInputs statement gets executed. Like in here the Map2.class gets executed.
Both mappers can't read the same file at the same time.
Solution (Workaround): Create a duplicate of the input file (in this case let duplicate rec file be rec1). Then feed mapper1 with rec and mapper2 with rec1.
Both mappers are executed parallel so you don't need to worry about reducer output because both mappers output will be shuffled so that equal keys from both the files go to same reducer.
so output is what you want.
Hope this helps to others who are facing similar issue.
You wont be able to read from the same file at the same time with two separate
Mapper
s (at least not without some devilishly hack-ish trickery which you should probably avoid).In any case, you cant have two
Mapper
classes set for the same job - the latter call tosetMapperClass(class)
will always overwrite the former. If you need twoMapper
s to run simultaneously, you'll need to make two separate jobs, and ensure that there are enough mappers available on your cluster to run them both simultaneously (if there aren't any available after the first job starts the second job will have to wait for it to finish, running sequentially rather than simultaneously.)However, due to the lack of a guarantee that the
Mapper
s will run concurrently, ensure that the functionality of your MapReduce jobs is not reliant on their concurrent execution.