Hive Data selecting latest value based on timestam

2020-02-29 11:27发布

I have a table having the following columns.

C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20

Now, I want to fetch unique row for each primary key based on latest Process TimeStamp.

If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen.

So, my result should be.

p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31

How to achieve this via HiveQL ?

I am currently using hive 0.10. I can not use subquery with IN or EXISTS.

Thanks.

标签: hive hiveql
2条回答
在下西门庆
2楼-- · 2020-02-29 11:51
select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
  select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
  from my_table group by C1
) t;

Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.

查看更多
别忘想泡老子
3楼-- · 2020-02-29 11:59

You should strongly consider upgrading Hive, this can be easily done with a window function included in Hive 11+ using row_number(partition by c1 order by ProcessTimeStamp desc) in a sub-select an selecting the first row in an outer select.

You don't need to update your entire cluster to upgrade Hive, you can just deploy it to one node.

查看更多
登录 后发表回答