I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:
events
.keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
.reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
.values
it is simple to start with
events
.groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))
but what's next? What if I'd like to iterate over every element in the current group? Is it even possible? Thanks in advance.
GroupedData
cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant ofagg
method for example:or
where
client_send_timestamp
is a column you want to aggregate.If you want to keep information than aggregate just
join
or use Window functions - see Find maximum row per group in Spark DataFrameSpark also supports User Defined Aggregate Functions - see How to define and use a User-Defined Aggregate Function in Spark SQL?
Spark 2.0+
You could use
Dataset.groupByKey
which exposes groups as an iterator.