Order by Value in Spark pairRDD from (Key,Value) w

2019-08-14 02:40发布

问题:

I have created a map like this -

val b = a.map(x => (x(0), x) ) 

Here b is of the type

org.apache.spark.rdd.RDD[(Any, org.apache.spark.sql.Row)]
  1. How can I sort the PairRDD within each key using a field from the value row?
  2. After that I want to run a function which processes all the values for each Key in isolation in the previously sorted order. Is that possible? If yes can you please give an example.
  3. Is there any consideration needed for Partitioning the Pair RDD?

回答1:

Answering only your first question:

val indexToSelect: Int = ??? //points to sortable type (has Ordering or is Ordered)
sorted = rdd.sortBy(pair => pair._2(indexToSelect))

What this does, it just selects the second value in the pair (pair._2) and from that row it selects the appropriate value ((indexToSelect) or more verbosely: .apply(indexToSelect)).