These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
- How do I enable partition pruning in spark
In addition to @Hlib answer, I would like to add few more points.
groupByKey()
is just to group your dataset based on a key.reduceByKey()
is something like grouping + aggregation. We can say reduceBykey() equvelent todataset.group(...).reduce(...)
.aggregateByKey()
is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.I think official guide explains it well enough.
I will highlight differences (you have RDD of type
(K, V)
):groupByKey
K
), you have two choices:reduceByKey
oraggregateByKey
(reduceByKey
is kind of particularaggregateByKey
)(V, V)
and returnsV
, so that all the values of the group can be reduced to the one single value of the same type, then usereduceByKey
. As a result you will have RDD of the same(K, V)
type.aggregateByKey
. It happens when you reduce values to another type. So you will have(K, V2)
as a result.