I ran a insert overwrite on a partitioned table. After the command, say for example the below partitions are created.
a,b,c,d,e
Now when I rerun the Insert overwrite table, but this time with completely different set of data. Say for example, after the 2nd insert, below partitions get created.
f,g,h,i,j
Question- Will the data from second insert not overwrite the data belonging to first insert. I still see the folders a,b,c,d,e in HDFS after the 2nd insert. But when I query them through HIVE, partitions a,b,c,d,e doesn't show up.
Is this the expected behavior of Insert overwrite table? or Am I missing anything here?
Here are some sample scripts and their effects.
Let's say we have a table SOME_TABLE
with three fields, A
, B
, and C
, and two partition fields, YEAR
, and MONTH
.
Let's assume that we want to delete the data in the table but not the structure of the table.
1. No partition specifications
If we run
INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
YEAR
,MONTH
)
SELECT A,B,C,YEAR,MONTH
FROM SOME_TABLE
WHERE FALSE
then the query executes but the data stays there.
2. Partial partition specifications
If we run
INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
YEAR=2018
,MONTH
)
SELECT A,B,C,MONTH
FROM SOME_TABLE
WHERE FALSE
then the same happens: the data stays there.
3. Full partition specifications
If we run
INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
YEAR=2018
,MONTH=11
)
SELECT A,B,C
FROM SOME_TABLE
WHERE FALSE
only then will the data be erased, and only for the given year and month. The data in other partitions doesn't get deleted.
Note that no partitions have been dropped, and a SHOW PARTITIONS SOME_TABLE
returns all partitions, including those that existed before running an INSERT OVERWRITE
.
Side note:
I'd be curious to know if there is a way to delete all existing data, but keep the table structure. DROP TABLE
would drop the structure. ALTER TABLE ... DROP PARTITION
would only delete the targeted partitions, so we would have to know in advance what range of values each partition field takes. In this example, we could do
ALTER TABLE SOME_TABLE DROP PARTITION(year>=0,month>0) PURGE;
but it still seems too complex a solution.
Thats strange. I tried the above scenario but see different results. First did insert overwrite which created a,b,c,d,e partitions. Able to see them via SHOW PARTITIONS command as well and able to do select queries which return data.
Then I did another insert overwrite which created five new paritions f,g,h,i,j. When i do SHOW PARTITIONS i see all the partitions ranging from a to j. Also the data is intact in all the partitions and m able to fetch it via select statements.
As a next step i tried loading data into an existing partition say a and I could see that data overwrites existing data in the partition.
This is expected behavior.
Please share statements you used to create the table and load data into it.
Insert overwrite in hive deletes all existing data, and than write new data with the partition you created before(when you created your table). When you create partitions, they is added to hive metadata, but they stay there until you drop partitions or table.Thus, when you ``overwrite` a table, those partitions still apply to new data.