What is the use of grouping comparator in hadoop m

2019-01-08 10:03发布

I would like to know why grouping comparator is used in secondary sort of mapreduce.

According to the definitive guide example of secondary sorting

We want the sort order for keys to be by year (ascending) and then by temperature (descending):

1900 35°C
1900 34°C
1900 34°C
...
1901 36°C
1901 35°C

By setting a partitioner to partition by the year part of the key, we can guarantee that records for the same year go to the same reducer. This still isn’t enough to achieve our goal, however. A partitioner ensures only that one reducer receives all the records for a year; it doesn’t change the fact that the reducer groups by key within the partition.

Since we would have already written our own partitioner which would take care of the map output keys going to particular reducer,so why should we group it.

Thanks in advance

4条回答
你好瞎i
2楼-- · 2019-01-08 10:44

You need to introduce an intermediate key that is a composite of the year and temperature; partition on the natural key (the year) and introduce a comparator that will sort on the entire composite key. You're right that by partitioning on the year you'll get all the data for a year in the same reducer, so the comparator will effectively sort the data for each year by the temperature.

查看更多
啃猪蹄的小仙女
3楼-- · 2019-01-08 10:58

The default partitioner calculates the hash of the key, and those keys which has the same hash value will be sent to the same reducer. If you have a composite(natural+augment) key emitted in your mapper and if you want to send the keys which has the same natural key to the same reducer then you have to implement a custom partitioner.

public class SimplePartitioner implements Partitioner {
@Override
public int getPartition(Text compositeKey, LongWritable value, int numReduceTasks) {
    //Split the key into natural and augment
    String naturalKey = compositeKey.toString().split("separator")


    return naturalKey.hashCode();
}

}

And now if you want all your relevant rows within a partition of data are sent to a single reducer you must also implement a grouping comparator which considers only the natural key

public class SimpleGroupingComparator extends WritableComparator {

@Override
public int compare(Text compositeKey1, Text compositeKey2) {


return compare(compositeKey1.getNaturalKey(),compositeKey2.getNaturalKey());
}

}

查看更多
走好不送
4楼-- · 2019-01-08 11:02

In support of the chosen answer I add:

Following on from this explanation

**Input**:

    symbol time price
    a      1    10
    a      2    20
    b      3    30

**Map output**: create composite key\values like so:

> symbol-time time-price
>
>**a-1**         1-10
>
>**a-2**         2-20
>
>**b-3**         3-30

The Partitioner: will route the a-1 and a-2 keys to the same reducer despite the keys being different. It will also route the b-3 to a separate reducer.

GroupComparator: once the composites key\value arrive at the reducer instead of the reducer getting

>(**a-1**,{1-10})
>
>(**a-2**,{2-20})

the above will happen due to the unique key values following composition.

the group comparator will ensure the reducer gets:

(a-1,{**1-10,2-20**})

The key of the grouped values will be the one which comes first in the group. This can be controlled by Key comparator.

**[[In a single reduce method call.]]**
查看更多
SAY GOODBYE
5楼-- · 2019-01-08 11:03

Let me improve the statement "... take care of the map output keys going to particular reducer".

Reducer Instance vs reduce method: One JVM is created per Reduce task and each of these has a single instance of the Reducer class.This is Reducer instance(I call it Reducer from now).Within each Reducer, reduce method is called multiple times depending on 'key grouping'.Each time reduce is called, 'valuein' has a list of map output values grouped by the key you define in 'grouping comparator'.By default, grouping comparator uses the entire map output key.

In the example, map output key is changed to 'year and temperature' to achieve sorting.Unless you define a grouping comparator that uses only the 'year' part of the map output key,you can't make all records of the same year go to same reduce method call.

查看更多
登录 后发表回答