Hive Data selecting latest value based on timestam

2020-02-29 11:12发布

问题:

I have a table having the following columns.

C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20

Now, I want to fetch unique row for each primary key based on latest Process TimeStamp.

If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen.

So, my result should be.

p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31

How to achieve this via HiveQL ?

I am currently using hive 0.10. I can not use subquery with IN or EXISTS.

Thanks.

回答1:

select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
  select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
  from my_table group by C1
) t;

Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.



回答2:

You should strongly consider upgrading Hive, this can be easily done with a window function included in Hive 11+ using row_number(partition by c1 order by ProcessTimeStamp desc) in a sub-select an selecting the first row in an outer select.

You don't need to update your entire cluster to upgrade Hive, you can just deploy it to one node.



标签: hive hiveql