Get the max value for each key in a Spark RDD

2019-01-18 14:14发布

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?

I have in RDD format:

[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]

And I need to return:

[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]

Ties can return the first value or random.

标签： python apache-spark pyspark rdd

1条回答

傲

2楼-- · 2019-01-18 15:06

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:

0人赞添加讨论(0) 举报

Get the max value for each key in a Spark RDD

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间