Hive Data selecting latest value based on timestam

2020-02-29 11:27发布

I have a table having the following columns.

C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20

Now, I want to fetch unique row for each primary key based on latest Process TimeStamp.

If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen.

So, my result should be.

p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31

How to achieve this via HiveQL ?

I am currently using hive 0.10. I can not use subquery with IN or EXISTS.

Thanks.

标签： hive hiveql

2条回答

在下西门庆

2楼-- · 2020-02-29 11:51

select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
  select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
  from my_table group by C1
) t;

Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2020-02-29 11:59

You should strongly consider upgrading Hive, this can be easily done with a window function included in Hive 11+ using row_number(partition by c1 order by ProcessTimeStamp desc) in a sub-select an selecting the first row in an outer select.

You don't need to update your entire cluster to upgrade Hive, you can just deploy it to one node.

0人赞添加讨论(0) 举报

Hive Data selecting latest value based on timestam

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间