问题:

What is the formula that Spark uses to calculate the number of reduce tasks?

I am running a couple of spark-sql queries and the number of reduce tasks always is 200. The number of map tasks for these queries is 154. I am on Spark 1.4.1.

Is this related to spark.shuffle.sort.bypassMergeThreshold, which defaults to 200

回答1:

It's spark.sql.shuffle.partitions that you're after. According to the Spark SQL programming guide:

spark.sql.shuffle.partitions    200     Configures the number of partitions to use when shuffling data for joins or aggregations.

Another option that is related is spark.default.parallelism, which determines the 'default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user', however this seems to be ignored by Spark SQL and only relevant when working on plain RDDs.

回答2:

Yes, @svgd, that is the correct parameter. Here is how you reset it in Scala:

// Set number of shuffle partitions to 3
sqlContext.setConf("spark.sql.shuffle.partitions", "3")
// Verify the setting 
sqlContext.getConf("spark.sql.shuffle.partitions")

Number reduce tasks Spark

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮