hadoop hive count concurrency

2019-08-28 10:59发布

How to implement it in hadoop?

In hive, I have a table with lots columns, which two of them are begin_time, end_time.

I need to count the number on the each time

a piece of the table is this:

begin_time                  end_time
2011.04.26 10:19:06^A2011.04.26 10:20:22
2011.04.26 10:19:08^A2011.04.26 10:21:49
2011.04.26 10:19:08^A2011.04.26 11:18:46
2011.04.26 10:19:09^A2011.04.26 12:08:36
2011.04.26 10:19:09^A2011.04.26 11:00:16
2011.04.26 10:19:11^A2011.04.26 10:19:17
2011.04.26 10:19:12^A2011.04.26 10:46:21
2011.04.26 10:19:13^A2011.04.26 10:55:43
2011.04.26 10:19:17^A2011.04.26 10:19:41
2011.04.26 10:19:18^A2011.04.26 10:34:41

the result I want is how many people is in on a specific time.

e.g. on 2011.04.26 10:19:08, there 3 visitor on course there one in 19:06, and 2 in 19:08.

and 2011.04.26 10:19:18 is 9, course ten but one leave on 2011.04.26 10:19:17

the desired result for piece is

2011.04.26 10:19:06 1
2011.04.26 10:19:08 3
2011.04.26 10:19:09 5
2011.04.26 10:19:11 6
2011.04.26 10:19:12 7
2011.04.26 10:19:13 8
2011.04.26 10:19:17 9
2011.04.26 10:19:18 9

Any help is much appreciated and welcome.

标签: java hadoop hive
1条回答
甜甜的少女心
2楼-- · 2019-08-28 11:46

You can try this on hive (assume the table name is test_log):

select /*+ MAPJOIN(driven) */ driven.time, count(*)    
from         
    (select time 
     from 
     (select begin_time time from test_log union all 
      select end_time time from test_log) u  
     group by time) driven
join test_log l on true
where
    driven.time between l.begin_time and l.end_time
group by driven.time

Probably not the best solution but at least it works. You can add some filter on the driven subquery to reduce the data set.

查看更多
登录 后发表回答