I want to overwrite specific partitions instead of all in spark. I am trying the following command:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
where df is dataframe having the incremental data to be overwritten.
hdfs-base-path contains the master data.
When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.
What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?
If you use DataFrame, possibly you want to use Hive table over data. In this case you need just call method
It'll overwrite partitions that DataFrame contains.
There's not necessity to specify format (orc), because Spark will use Hive table format.
It works fine in Spark version 1.6
I tried below approach to overwrite particular partition in HIVE table.
This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,
If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:
If you are using Spark prior to 1.6.2, you will also need to delete the
_SUCCESS
file in/root/path/to/data/partition_col=value
or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.
You could do something like this to make the job reentrant (idempotent): (tried this on spark 2.2)
Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.
Once the table is created, you would write your data to the
tmpLocation
Then you would recover the table partition paths by executing:
Get the partition paths by querying the Hive metadata like:
Delete these partitions from the
trgtTbl
and move the directories fromtmpTbl
totrgtTbl
As jatin Wrote you can delete paritions from hive and from path and then append data Since I was wasting too much time with it I added the following example for other spark users. I used Scala with spark 2.2.1
}