Partial aggregation vs Combiners which one faster?

There are notice about what how cascading/scalding optimized map-side evaluation They use so called Partial Aggregation. Is it actually better approach then Combiners? Are there any performance comparison on some common hadoop tasks(word count for example)? If so wether hadoop will support this in future?

标签： hadoop cascading hadoop-plugins combiners

2条回答

做自己的国王

2楼-- · 2019-05-16 10:28

It is better for certain type of aggregations. Cascading aggregations are a bit more flexible as to what can be aggregated . from the cascading site (emphasis mine):

Cascading does not support the so called MapReduce Combiners. Combiners are very powerful in that they reduce the IO between the Mappers and Reducers. Why send all your Mapper to data to Reducers when you can compute some values Map side and combine them in the Reducer. But Combiners are limited to Associative and Commutative functions only, like 'sum' and 'max'. And in order to work, values emitted from the Map task must be serialized, sorted (deserialized and compared), deserialized again and operated on, where again the results are serialized and sorted. Combiners trade CPU for gains in IO.

Cascading takes a different approach by providing a mechanism to perform partial aggregations Map side and also combine them Reduce side. But Cascading chooses to trade Memory for IO gains by caching values (up to a threshold). This approach bypasses the unnecessary serialization, deserialization, and sorting steps. It also allows for any aggregate function to be implemented, not just Associative and Commutative ones.

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-05-16 10:35

In practice, there are more benefits from partial aggregation than from use of combiners.

The cases where combiners are useful are limited. Also, combiners optimize the amount of throughput required by the tasks, not the number of reduces -- that's a subtle distinction which adds up to significant performance deltas.

There is a much broader range of use cases for partial aggregation in large distributed workflows. Also, partial aggregation can be used to optimize the number of job steps required for a workflow.

Examples are shown in https://github.com/Cascading/Impatient/wiki/Part-5 which uses CountBy and SumBy partial aggregates. If you look back in the code commit history on GitHub for that project, there was previously use of GroupBy and Count, which resulted in more reduces.

0人赞添加讨论(0) 举报

Partial aggregation vs Combiners which one faster?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间