There are notice about what how cascading/scalding optimized map-side evaluation They use so called Partial Aggregation. Is it actually better approach then Combiners? Are there any performance comparison on some common hadoop tasks(word count for example)? If so wether hadoop will support this in future?
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- Java写文件至HDFS失败
- mapreduce count example
- Could you give me any clue Why 'Cannot call me
- Hive error: parseexception missing EOF
- Exception in thread “main” java.lang.NoClassDefFou
- ClassNotFoundException: org.apache.spark.SparkConf
- How can I configure the maven shade plugin to incl
- How was the container created and how does it work
It is better for certain type of aggregations. Cascading aggregations are a bit more flexible as to what can be aggregated . from the cascading site (emphasis mine):
In practice, there are more benefits from partial aggregation than from use of combiners.
The cases where combiners are useful are limited. Also, combiners optimize the amount of throughput required by the tasks, not the number of reduces -- that's a subtle distinction which adds up to significant performance deltas.
There is a much broader range of use cases for partial aggregation in large distributed workflows. Also, partial aggregation can be used to optimize the number of job steps required for a workflow.
Examples are shown in https://github.com/Cascading/Impatient/wiki/Part-5 which uses
CountBy
andSumBy
partial aggregates. If you look back in the code commit history on GitHub for that project, there was previously use ofGroupBy
andCount
, which resulted in more reduces.