What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey: