Java Mapreduce sort composite value

2019-07-09 03:34发布

问题:

I have a mapper that emits a Text (fruit name) key and a custom composite value city:count. I want to sort the composite values by the count prior to it arriving to the reducer such that the reducer can quickly determine which city has the highest count.

The composite value class is an extension of WritableComparable and has methods for retrieving the count and city.

What my reducer currently receives:

reducer 1 - oranges:<london:2, chicago:15, charleston:6>
reducer 2 - apples:<charleston:31, london:3, chicago:29>
...

What I want my reducer to receive:

reducer 1 - oranges:<chicago:15, charleston:6, london:2>
reducer 2 - apples:<charleston:31, chicago:29, london:3>

Logically, how do I make this happen? I've read several articles on Secondary Sorting/Ordering, but they tend to focus on composite keys as opposed to composite values. My keys don't need furthering partitioning nor do they need further sorting.

Again, sorting by a composite VALUE not a composite key!

回答1:

If you are only aiming at fast determination of the highest amount of a fruit i'd like to recommend another approach. Since sorting in most cases has a complexity of O(n log n) while finding the biggest entry only has O(n) where n is the number of cities in your case.

1. Mapper with Memory

You can use a hashmap in each mapper to determine the highest amount for each fruit per mapper. Just use fruit as key and city+count as value. When you get a fruit look into the map to compare for the bigger one. If the fruit did not already exist you obviously have to set it. When all map steps are executed the framework calls the cleanup method of your mapper. In the cleanup you can emit the entries of the map. This will reduce the number of values you have to send and go through in the reducer significantly.

2. Combiner

The approach 1. has one significant draw back. It is not scalable if you have a high amount of fruits which didn't fit into the memory. If this is the case you can use a combiner which is executed at mapper side. It works like a reducer for a smaller set of data given by the corresponding mapper. This would also lead to the benefit of a reduced number of values you send to the reducer.

3. Secondary Ordering

You can do it with secondary ordering. I really like to encourage you to read the article provided by Preeti Khurana. Especially the answer of Sudarshan. To give you a brief idea: Use a composite key of fruit:count and the value of city:count. Be aware that you need a special partitioning based on the first part of the key. I think this would be a high amount of effort but in some cases it is useful and necessary.