I have a RDD of (key, value) that I transformed into a RDD of (key, List(value1, value2, value3) as follow.
val rddInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10)))
val rddReduced = rddInit..groupByKey.mapValues(_.toList)
rddReduced.take(3).foreach(println)
This code give me the next RDD : (1,List(2, 3)) (2,List(5, 7)) (3,List(10))
But now I would like to go back to the rddInit from the rdd I just computed (the rddReduced rdd).
My first guess is to realise some kind of cross product between the key and each element of the List like this :
rddReduced.map{
case (x, y) =>
val myList:ListBuffer[(Int, Int)] = ListBuffer()
for(element <- y) {
myList+=new Pair(x, element)
}
myList.toList
}.flatMap(x => x).take(5).foreach(println)
With this code, I get the initial RDD as a result. But I don't think using a ListBuffer inside a spark job is a good practice. Is there any other way to resolve this problem ?
According to your question, I think this is what you want to do
You get a list after group by in which you can map through it again.
Here's one way to restore the grouped RDD back to original:
I'm surprised no one has offered a solution with Scala's for-comprehension (that gets "desugared" to
flatMap
andmap
at compile time).I don't use this syntax very often, but when I do...I find it quite entertaining. Some people prefer for-comprehension over a series of
flatMap
andmap
, esp. for more complex transformations.After all, that's why we enjoy flexibility of Scala, isn't it?
It's obviously not a good practice to use that kind of operation.
From what I have learned in a spark-summit course, you have to use
Dataframe
s andDataset
s as much as possible, using them you will benefit from a lot of optimizations form spark engine.What you wanna do is called
explode
and it's preformed by applying theexplode
method from thesql.functions
packageThe solution whould be something like this :
dfResult
will contains the same data as thedfInit