hive sql find the latest record

2019-03-08 19:48发布

the table is:

create table test (
id string,
name string,
age string,
modified string)

data like this:

id    name   age  modifed
1     a      10   2011-11-11 11:11:11
1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-12 10:11:12
2     b      20   2012-12-15 10:11:12

I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:

1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-15 10:11:12

I do like this:

insert overwrite table t 
select b.id, b.name, b.age, b.modified 
from (
        select id,max(modified) as modified 
        from test 
        group by id
) a 
left outer join test b on (a.id=b.id  and a.modified=b.modified);

This sql can get the right result,but when mass data,it runs slow.

**Is there any way to do this without left outer join? **

8条回答
2楼-- · 2019-03-08 20:32

Give this a try:

select t1.* from test t1
join (
  select id, max(modifed) maxModified from test
  group by id
) s
on t1.id = s.id and t1.modifed = s.maxModified

Fiddle here.

Left outer join solution here.

Let us know which one runs faster :)

查看更多
啃猪蹄的小仙女
3楼-- · 2019-03-08 20:35

Just slightly different approach than what has been answered in back previous answer.

Below example uses hive windowing function to find out the latest record, read more here

SELECT t.id
    ,t.name
    ,t.age
    ,t.modified
FROM (
    SELECT id
        ,name
        ,age
        ,modified
        ,ROW_NUMBER() OVER (
            PARTITION BY id ORDER BY unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') DESC
            ) AS ROW_NUMBER   
    FROM test
    ) t
WHERE t.ROW_NUMBER <= 1;

The modified is string so converting it to timestamp using unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') then applying order by on timestamp.

查看更多
登录 后发表回答