Overwrite only some partitions in a partitioned sp

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.

标签： apache-spark hive apache-spark-dataset

3条回答

Root（大扎）

2楼-- · 2019-01-09 00:10

Before Spark 2.3.0 there is a JIRA created for this. In 2.3.0 this is fixed.

https://issues.apache.org/jira/browse/SPARK-20236

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2019-01-09 00:24

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

0人赞添加讨论(0) 举报

孤傲高冷的网名

4楼-- · 2019-01-09 00:35

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

from the source code:

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here

0人赞添加讨论(0) 举报

Overwrite only some partitions in a partitioned sp

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间