I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.
In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.
From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?
I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.
With default
PartitionCoalescer
, if number of partitions is larger than a current number of partitions and you don't setshuffle
totrue
then number of partitions stays unchanged.coalesce
withshuffle
set totrue
from the other hand is equivalent torepartition
with the same value ofnumPartitions
.