How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.
Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.
Before Spark 2.3.0 there is a JIRA created for this. In 2.3.0 this is fixed.
https://issues.apache.org/jira/browse/SPARK-20236
Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new
spark.sql.sources.partitionOverwriteMode
setting todynamic
, the dataset needs to be partitioned, and the write modeoverwrite
. Example:I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
Just FYI, for PySpark users make sure to set
overwrite=True
in theinsertInto
otherwise the mode would be changed toappend
from the source code:
this how to use it:
or in the SQL version works fine.
for doc look at here