I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.
In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.
From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?
I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.