This question is about the duality between DataFrame
and RDD
when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required.
Is there an efficient way to apply pair RDD operations such as aggregateByKey
to a DataFrame which has been grouped using GROUP BY or ordered using ORDERED BY?
Normally, one would need an explicit map
step to create key-value tuples, e.g., dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...)
. Can this be avoided?
Not really. While
DataFrames
can be converted toRDDs
and vice versa this is relatively complex operation and methods likeDataFrame.groupBy
don't have the same semantics as their counterparts onRDD
.The closest thing you can get is a new
DataSet
API introduced in Spark 1.6.0. It provides a much closer integration withDataFrames
andGroupedDataset
class with its own set of methods includingreduce
,cogroup
ormapGroups
:In some specific cases it is possible to leverage
Orderable
semantics to group and process data usingstructs
orarrays
. You'll find an example in SPARK DataFrame: select the first row of each group