Hi I am trying to see if there are any settings like executor memory, cores, shuffle partition or anything we can think of that might speed up a job which includes union
,GroupByKey
, and reduceGroups
operations
I understand these intense operations to perform and its currently taking 5 hours to finish this.
example:
.union(transitive)
.union(family)
.groupByKey(_.key)
.reduceGroups((left, right) =>
spark submit
"Step5_Spark_Command": "command-runner.jar,spark-submit,--class,com.ms.eng.link.modules.linkmod.Links,--name,\\\"Links\\\",--master,yarn,--deploy-mode,client,--executor-memory,32G,--executor-cores,4,--conf,spark.sql.shuffle.partitions=2020,/home/hadoop/linking.jar,jobId=#{myJobId},environment=prod",
The function
val family =
generateFamilyLinks(references, superNodes.filter(_.linkType == FAMILY))
.checkpoint(eager = true)
direct
.union(reciprocal)
.union(transitive)
.union(family)
.groupByKey(_.key)
.reduceGroups((left, right) =>
left.copy(
contributingReferences = left.contributingReferences ++ right.contributingReferences,
linkTypes = left.linkTypes ++ right.linkTypes,
contexts = left.contexts ++ right.contexts
)
)
.map(group =>
group._2.copy(
contributingReferences = ArrayUtil.dedup(group._2.contributingReferences, _.key)
)