How to delete duplicate records from Hive table?

2019-03-20 12:46发布

问题:

I am trying to learn about deleting duplicate records from a Hive table.

My Hive table: 'dynpart' with columns: Id, Name, Technology

Id  Name  Technology
1   Abcd  Hadoop
2   Efgh  Java
3   Ijkl  MainFrames
2   Efgh  Java

We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.

Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.

回答1:

You can use insert overwrite statement to update data

insert overwrite table dynpart select distinct * from dynpart;


回答2:

you can insert distinct records into some other table

create table temp as select distinct * from dynpart


标签: hadoop hive