Spark Coalesce More Partitions

2019-08-15 04:33发布

问题:

I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.

In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.

From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?

I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.

回答1:

With default PartitionCoalescer, if number of partitions is larger than a current number of partitions and you don't set shuffle to true then number of partitions stays unchanged.

coalesce with shuffle set to true from the other hand is equivalent to repartition with the same value of numPartitions.