De-duplicating rows in a table with respect to cer

2019-08-11 11:17发布

I need to create a temporary table in HIVE using an existing table that has 7 columns. I just want to get rid of duplicates with respect to first three columns and also retain the corresponding values in the other 4 columns. I don't care which row is actually dropped while de-duplicating using first three rows alone.

标签： hive

1条回答

Rolldiameter

2楼-- · 2019-08-11 11:48

You could use something as below if you are not considered about ordering

create table table2 as 
select col1, col2, col3, 
      ,split(agg_col,"|")[0] as col4
      ,split(agg_col,"|")[1] as col5
      ,split(agg_col,"|")[2] as col6
      ,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
             max(concat(cast(col4 as string),"|", 
                        cast(col5 as string),"|",
                        cast(col6 as string),"|",
                        cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;

Below is another approach, which gives much control over ordering but slower than above approach

create table table2 as 
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
             rank() over ( partition by col1, col2, col3 
                           order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;

rank() over(..) function returns more than one column with rank as '1' if order by columns are all equal. In our case if there are 2 columns with exact same values for all seven columns then there will be duplicates when we use filter as col_rank =1. These duplicates can be eleminated using max and group by clauses as written in above query.

0人赞添加讨论(0) 举报

De-duplicating rows in a table with respect to cer

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间