There are notice about what how cascading/scalding optimized map-side evaluation They use so called Partial Aggregation. Is it actually better approach then Combiners? Are there any performance comparison on some common hadoop tasks(word count for example)? If so wether hadoop will support this in future?
问题:
回答1:
In practice, there are more benefits from partial aggregation than from use of combiners.
The cases where combiners are useful are limited. Also, combiners optimize the amount of throughput required by the tasks, not the number of reduces -- that's a subtle distinction which adds up to significant performance deltas.
There is a much broader range of use cases for partial aggregation in large distributed workflows. Also, partial aggregation can be used to optimize the number of job steps required for a workflow.
Examples are shown in https://github.com/Cascading/Impatient/wiki/Part-5 which uses CountBy
and SumBy
partial aggregates. If you look back in the code commit history on GitHub for that project, there was previously use of GroupBy
and Count
, which resulted in more reduces.
回答2:
It is better for certain type of aggregations. Cascading aggregations are a bit more flexible as to what can be aggregated . from the cascading site (emphasis mine):
Cascading does not support the so called MapReduce Combiners. Combiners are very powerful in that they reduce the IO between the Mappers and Reducers. Why send all your Mapper to data to Reducers when you can compute some values Map side and combine them in the Reducer. But Combiners are limited to Associative and Commutative functions only, like 'sum' and 'max'. And in order to work, values emitted from the Map task must be serialized, sorted (deserialized and compared), deserialized again and operated on, where again the results are serialized and sorted. Combiners trade CPU for gains in IO.
Cascading takes a different approach by providing a mechanism to perform partial aggregations Map side and also combine them Reduce side. But Cascading chooses to trade Memory for IO gains by caching values (up to a threshold). This approach bypasses the unnecessary serialization, deserialization, and sorting steps. It also allows for any aggregate function to be implemented, not just Associative and Commutative ones.