Multiple output path (Java - Hadoop - MapReduce)

2019-03-31 10:54发布

问题:

I do two MapReduce job, and I want for the second job to be able to write my result into two different files, in two different directories. I would like something similar to FileInputFormat.addInputPath(.., multiple input path) in a sense, but for the output.

I'm completely new to MapReduce, and I have a specificity to write my code in Hadoop 0.21.0 I use context.write(..) in my Reduce step, but I don't see how to control multiple output paths...

Thanks for your time !

My reduceCode from my first job, to show you I only know how to output (it goes into a /../part* file. But now what I would like is to be able to specify two precises files for different output, depending on the key) :

public static class NormalizeReducer extends Reducer<LongWritable, NetflixRating, LongWritable, NetflixUser> {
    public void reduce(LongWritable key, Iterable<NetflixRating> values, Context context) throws IOException, InterruptedException {
        NetflixUser user = new NetflixUser(key.get());
        for(NetflixRating r : values) {
            user.addRating(new NetflixRating(r));
        }
        user.normalizeRatings();
        user.reduceRatings();
        context.write(key, user);
    }
}

EDIT: so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :

http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

  • MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
  • addNamedOutput method: String namedOutput is just a word who describe.
  • You define the path actually in the write method, the String baseOutputPath arg.

回答1:

so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :

http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs. addNamedOutput method: String namedOutput is just a word who describe. You define the path actually in the write method, the String baseOutputPath arg.