optimize/tune setting to spark job, where the job

2019-08-11 09:36发布

问题:

Hi I am trying to see if there are any settings like executor memory, cores, shuffle partition or anything we can think of that might speed up a job which includes union,GroupByKey, and reduceGroups operations

I understand these intense operations to perform and its currently taking 5 hours to finish this.

example:

.union(transitive)
  .union(family)
  .groupByKey(_.key)
  .reduceGroups((left, right) =>

spark submit

"Step5_Spark_Command": "command-runner.jar,spark-submit,--class,com.ms.eng.link.modules.linkmod.Links,--name,\\\"Links\\\",--master,yarn,--deploy-mode,client,--executor-memory,32G,--executor-cores,4,--conf,spark.sql.shuffle.partitions=2020,/home/hadoop/linking.jar,jobId=#{myJobId},environment=prod",

The function

val family =
      generateFamilyLinks(references, superNodes.filter(_.linkType == FAMILY))
        .checkpoint(eager = true)
    direct
      .union(reciprocal)
      .union(transitive)
      .union(family)
      .groupByKey(_.key)
      .reduceGroups((left, right) =>
        left.copy(
          contributingReferences = left.contributingReferences ++ right.contributingReferences,
          linkTypes = left.linkTypes ++ right.linkTypes,
          contexts = left.contexts ++ right.contexts
        )
      )
      .map(group =>
        group._2.copy(
          contributingReferences = ArrayUtil.dedup(group._2.contributingReferences, _.key)
        )

回答1:

By looking into your spark-submit command, I can see you're running Spark on YARN. But may I ask why on client mode? Actually in your case your driver and executors are both being created in local only and not leveraging the resources of your cluster fully. So, use --deploy-mode cluster instead.

For hardware provisioning, use this link.

Hope, this helps in scaling your app.