the table is:
create table test (
id string,
name string,
age string,
modified string)
data like this:
id name age modifed
1 a 10 2011-11-11 11:11:11
1 a 11 2012-11-11 12:00:00
2 b 20 2012-12-10 10:11:12
2 b 20 2012-12-10 10:11:12
2 b 20 2012-12-12 10:11:12
2 b 20 2012-12-15 10:11:12
I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:
1 a 11 2012-11-11 12:00:00
2 b 20 2012-12-15 10:11:12
I do like this:
insert overwrite table t
select b.id, b.name, b.age, b.modified
from (
select id,max(modified) as modified
from test
group by id
) a
left outer join test b on (a.id=b.id and a.modified=b.modified);
This sql can get the right result,but when mass data,it runs slow.
**Is there any way to do this without left outer join? **
try this
You can get the required result without using left outer join like this:
select * from test where (id, modified) in(select id, max(modified) from test group by id)
http://sqlfiddle.com/#!2/bfbd5/42
There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:
You can do this:
and get the result:
I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?
There is a relatively recent feature of Hive SQL, analytic functions and the over clause. This should do the job without joins
What's going on here is that the subquery produces a new row with an extra column last_modified which has the latest modified timestamp for the corresponding person's id. (Similar to what group by would do) The key here is that the subquery gets you again one row per row in your original table and then you filter from that.
There is a chance that even the simpler solution works:
By the way, the same code would work in Impala, too.
If u can make sure that row which has max modified also has max age in same id row set.
Try
Presume the data is like this:
then the result of the above query will give you - (notice the repeated 2, b having the same date time)
This query runs an additional group by and is less efficient but gives the correct result -
then the result of the above query will give you
Now if we improve the query a little - then in place of 3 MRs it runs only one Keping the result same -
Note - this will slow down in case your group by field produces large results.