Two equal combine keys do not get to the same redu

2019-05-26 18:55发布

问题:

I'm making a Hadoop application in Java with the MapReduce framework.

I use only Text keys and values for both input and output. I use a combiner to do an extra step of computations before reducing to the final output.

But I have the problem that the keys do not go to the same reducer. I create and add the key/value pair like this in the combiner:

public static class Step4Combiner extends Reducer<Text,Text,Text,Text> {
    private static Text key0 = new Text();
    private static Text key1 = new Text();

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                key0.set("KeyOne");
                key1.set("KeyTwo");
                context.write(key0, new Text("some value"));
                context.write(key1, new Text("some other value"));
        }

}   

public static class Step4Reducer extends Reducer<Text,Text,Text,Text> {

            public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                System.out.print("Key:" + key.toString() + " Value: ");
                String theOutput = "";
                for (Text val : values) {
                    System.out.print("," + val);
                }
                System.out.print("\n");

                context.write(key, new Text(theOutput));
            }

}

In the main i creates the job like this:

Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

Job job4 = new Job(conf, "Step 4");
job4.setJarByClass(Step4.class);

job4.setMapperClass(Step4.Step4Mapper.class);
job4.setCombinerClass(Step4.Step4Combiner.class);
job4.setReducerClass(Step4.Step4Reducer.class);

job4.setInputFormatClass(TextInputFormat.class);
job4.setOutputKeyClass(Text.class);
job4.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job4, new Path(outputPath));
FileOutputFormat.setOutputPath(job4, new Path(finalOutputPath));            

System.exit(job4.waitForCompletion(true) ? 0 : 1);

The output in stdout printed from the reducer is this:

Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value
Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value
Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value

Which makes no sense since the keys are the same, and therefore it should be 2 reducers with 3 of the same values in it's Iterable

Hope you can help me get to the bottom of this :)

回答1:

This is most probably because your combiner is running in both map and reduce phases (a little known 'feature').

Basically you are amending the key in the combiner, which may or may not run as map outputs are merged together in the reducer. After the combiner is run (reduce side), the keys are fed through the grouping comparator to determine what values back the Iterable passed to the reduce method (i'm skirting around the streaming aspect of the reduce phase here - the iterable is not backed by a set or list of values, more calls to iterator().next() return true if the grouping comparator detemines the current key and the last key are the same)

You can try and detect the current combiner phase side (map or reduce) by inspecting the Context (there is a Context.getTaskAttempt().isMap() method, but i have some memory of this being problematic too, and there even might be a JIRA ticket about this somewhere).

Bottom line, don't amend the key in the combiner unless you can find away to bypass this bevaviour if the combiner is running reduce side.

EDIT So investigating @Amar's comment, i put together some code (pastebin link) which adds in some verbose comparators, combiners, reducers etc. If you run a single map job then in the reduce phase no combiner will run, and map output will not be sorted again as it is already assumed to be sorted.

It is assumed to be sorted as it is sorted prior to being sent into the combiner class, and it assumed that the keys will come out untouched - hence still sorted. Remember a Combiner is meant to Combine values for a given key.

So with a single map and the given combiner, the reducer sees the keys in KeyOne, KeyTwo, KeyOne, KeyTwo, KeyOne order. The grouping comparator sees a transition between them and hence you get 6 calls to the reduce function

If you use two mappers, then the reducer knows it has two sorted segments (one from each map), and so still needs to sort them prior to reducing - but because the number of segments is below a threshold, the sort is done as an inline stream sort (again the segments are assumed to be sorted). You still be the wrong output with two mappers (10 records output from the reduce phase).

So again, don't amend the key in the combiner, this is not what the combiner is intended for.



回答2:

Try this in the combiner instead:

context.write(new Text("KeyOne"), new Text("some value"));
context.write(new Text("KeyTwo"), new Text("some other value"));

The only way I see such a thing happening is if the key0 from one combiner is not found to be equal to the key0 from another. I am not sure how it would behave in case of keys pointing to the exact same instance (which is what would happen if you make the keys static).