Spark Coalesce More Partitions

2019-08-15 05:07发布

I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.

In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.

From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?

I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.

标签： apache-spark rdd coalesce

1条回答

不美不萌又怎样

2楼-- · 2019-08-15 05:10

With default PartitionCoalescer, if number of partitions is larger than a current number of partitions and you don't set shuffle to true then number of partitions stays unchanged.

coalesce with shuffle set to true from the other hand is equivalent to repartition with the same value of numPartitions.

0人赞添加讨论(0) 举报

Spark Coalesce More Partitions

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间