Get the max value for each key in a Spark RDD

2019-01-18 14:14发布

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?

I have in RDD format:

[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]

And I need to return:

[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]

Ties can return the first value or random.

1条回答
2楼-- · 2019-01-18 15:06

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:

查看更多
登录 后发表回答