These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
I think official guide explains it well enough.
I will highlight differences (you have RDD of type (K, V)
):
- if you need to keep the values, then use
groupByKey
- if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same
K
), you have two choices:reduceByKey
oraggregateByKey
(reduceByKey
is kind of particularaggregateByKey
)- 2.1 if you can provide an operation which take as an input
(V, V)
and returnsV
, so that all the values of the group can be reduced to the one single value of the same type, then usereduceByKey
. As a result you will have RDD of the same(K, V)
type. - 2.2 if you can not provide this aggregation operation, then use
aggregateByKey
. It happens when you reduce values to another type. So you will have(K, V2)
as a result.
- 2.1 if you can provide an operation which take as an input
回答2:
In addition to @Hlib answer, I would like to add few more points.
groupByKey()
is just to group your dataset based on a key.reduceByKey()
is something like grouping + aggregation. We can say reduceBykey() equvelent todataset.group(...).reduce(...)
.aggregateByKey()
is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.