What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
(Python)
(Java 7)
(Java 8)
API doc for reduceByKey: