How to select top N elements from a JavaPairRDD ?

2019-07-20 10:26发布

问题:

I have obtained a key/value pair, and sorted it into a new JavaPairRDD

Now, I need to select the top 5 elements from it, that is, to obtain a new JavaPairRDD with those top 5 elements in it.

How would I do that ?

Is there a simpler way than using the flatMap, since it seems like the unnecessary extra work ?

Thanks!

回答1:

Assuming you don't care about order, you can use RDD.take(5) to get the first 5 elements in an RDD.



回答2:

To get the top (or bottom) items (and answer the question you asked), you could use:

.takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]


回答3:

Syntax for using getting the smallest value of a priority queue:

assuming resultRdd = RDD[Double]
resultRdd.map (y => y.takeOrdered(x)(Ordering.by[Double]())

Syntax for using getting the largest value of a priority queue:

assuming resultRdd = RDD[Double]
resultRdd.map (y => y.top(x)(Ordering.by[Double]())

Note: ( top reverses the order and internally invokes takeOrdered )

def top(num: Int)(implicit ord: Ordering[T]): Array[T] = takeOrdered(num)(ord.reverse)