I have a data frame [df] :
+------------+-----------+------+
| id | itemName |Value |
--------------------------------+
| 1 | TV | 4 |
| 1 | Movie | 5 |
| 2 | TV | 6 |
I am trying to transform it to :
{id : 1, itemMap : { "TV" : 4, "Movie" : 5}}
{id : 2, itemMap : {"TV" : 6}}
I want the final result to be in RDD[(String, String)] with itemMap as the value's name
So I am doing :
case class Data (itemMap : Map[String, Int]) extends Serializable
df.map{
case r =>
val id = r.getAs[String]("id")
val itemName = r.getAs[String]("itemName")
val Value = r.getAs[Int]("Value")
(id, Map(itemName -> Value ))
}.reduceByKey((x, y) => x ++ y).map{
case (k, v) =>
(k, JacksonUtil.toJson(Data(v)))
}
But it takes forever to run. Is it efficient to use reducebyKey here ? Or can I use groupByKey ? Is there any other efficient way to do the transformation ?
My Config : I have 10 salves and a master of type r3.8xLarge
spark.driver.cores 30
spark.driver.memory 200g
spark.executor.cores 16
spark.executor.instances 40
spark.executor.memory 60g
spark.memory.fraction 0.95
spark.yarn.executor.memoryOverhead 2000
Is this the correct type of machine for this task ?