I have a table having the following columns.
C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20
Now, I want to fetch unique row for each primary key based on latest Process TimeStamp
.
If Process TimeStamp
is same then row having latest InsertDateTimeStamp
should be chosen.
So, my result should be.
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31
How to achieve this via HiveQL ?
I am currently using hive 0.10. I can not use subquery with IN or EXISTS.
Thanks.
Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.
You should strongly consider upgrading Hive, this can be easily done with a window function included in Hive 11+ using row_number(partition by c1 order by ProcessTimeStamp desc) in a sub-select an selecting the first row in an outer select.
You don't need to update your entire cluster to upgrade Hive, you can just deploy it to one node.