HIVE Insert overwrite into a partitioned Table

2019-08-02 12:14发布

问题:

I ran a insert overwrite on a partitioned table. After the command, say for example the below partitions are created. a,b,c,d,e

Now when I rerun the Insert overwrite table, but this time with completely different set of data. Say for example, after the 2nd insert, below partitions get created. f,g,h,i,j

Question- Will the data from second insert not overwrite the data belonging to first insert. I still see the folders a,b,c,d,e in HDFS after the 2nd insert. But when I query them through HIVE, partitions a,b,c,d,e doesn't show up.

Is this the expected behavior of Insert overwrite table? or Am I missing anything here?

回答1:

Here are some sample scripts and their effects.

Let's say we have a table SOME_TABLE with three fields, A, B, and C, and two partition fields, YEAR, and MONTH. Let's assume that we want to delete the data in the table but not the structure of the table.

1. No partition specifications

If we run

INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
     YEAR
    ,MONTH
)
SELECT A,B,C,YEAR,MONTH
FROM SOME_TABLE
WHERE FALSE

then the query executes but the data stays there.

2. Partial partition specifications

If we run

INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
     YEAR=2018
    ,MONTH
)
SELECT A,B,C,MONTH
FROM SOME_TABLE
WHERE FALSE

then the same happens: the data stays there.

3. Full partition specifications

If we run

INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
     YEAR=2018
    ,MONTH=11
)
SELECT A,B,C
FROM SOME_TABLE
WHERE FALSE

only then will the data be erased, and only for the given year and month. The data in other partitions doesn't get deleted.

Note that no partitions have been dropped, and a SHOW PARTITIONS SOME_TABLE returns all partitions, including those that existed before running an INSERT OVERWRITE.

Side note: I'd be curious to know if there is a way to delete all existing data, but keep the table structure. DROP TABLE would drop the structure. ALTER TABLE ... DROP PARTITION would only delete the targeted partitions, so we would have to know in advance what range of values each partition field takes. In this example, we could do

ALTER TABLE SOME_TABLE DROP PARTITION(year>=0,month>0) PURGE;

but it still seems too complex a solution.



回答2:

Thats strange. I tried the above scenario but see different results. First did insert overwrite which created a,b,c,d,e partitions. Able to see them via SHOW PARTITIONS command as well and able to do select queries which return data.

Then I did another insert overwrite which created five new paritions f,g,h,i,j. When i do SHOW PARTITIONS i see all the partitions ranging from a to j. Also the data is intact in all the partitions and m able to fetch it via select statements.

As a next step i tried loading data into an existing partition say a and I could see that data overwrites existing data in the partition. This is expected behavior.

Please share statements you used to create the table and load data into it.



回答3:

Insert overwrite in hive deletes all existing data, and than write new data with the partition you created before(when you created your table). When you create partitions, they is added to hive metadata, but they stay there until you drop partitions or table.Thus, when you ``overwrite` a table, those partitions still apply to new data.



标签: hadoop hive