SPARK : groupByKey vs reduceByKey which is better

2019-03-04 12:11发布

问题:

I have a data frame [df] :

+------------+-----------+------+
|   id       |  itemName |Value |
--------------------------------+
|   1        |  TV       |   4 |
|   1        |  Movie    |   5 |
|   2        |  TV       |   6 |

I am trying to transform it to :

{id : 1, itemMap : { "TV" : 4, "Movie" : 5}}
{id : 2, itemMap : {"TV" : 6}}

I want the final result to be in RDD[(String, String)] with itemMap as the value's name

So I am doing :

case class Data (itemMap : Map[String, Int]) extends Serializable


 df.map{
    case r =>
    val id = r.getAs[String]("id")
    val itemName = r.getAs[String]("itemName")
    val Value = r.getAs[Int]("Value")

    (id, Map(itemName -> Value ))

}.reduceByKey((x, y) => x ++ y).map{
      case (k, v) =>
        (k, JacksonUtil.toJson(Data(v)))
    }

But it takes forever to run. Is it efficient to use reducebyKey here ? Or can I use groupByKey ? Is there any other efficient way to do the transformation ?

My Config : I have 10 salves and a master of type r3.8xLarge

spark.driver.cores  30
spark.driver.memory 200g
spark.executor.cores    16
spark.executor.instances    40
spark.executor.memory   60g
spark.memory.fraction   0.95
spark.yarn.executor.memoryOverhead  2000

Is this the correct type of machine for this task ?