hadoop hive count concurrency

2019-08-28 11:10发布

问题:

How to implement it in hadoop?

In hive, I have a table with lots columns, which two of them are begin_time, end_time.

I need to count the number on the each time

a piece of the table is this:

begin_time                  end_time
2011.04.26 10:19:06^A2011.04.26 10:20:22
2011.04.26 10:19:08^A2011.04.26 10:21:49
2011.04.26 10:19:08^A2011.04.26 11:18:46
2011.04.26 10:19:09^A2011.04.26 12:08:36
2011.04.26 10:19:09^A2011.04.26 11:00:16
2011.04.26 10:19:11^A2011.04.26 10:19:17
2011.04.26 10:19:12^A2011.04.26 10:46:21
2011.04.26 10:19:13^A2011.04.26 10:55:43
2011.04.26 10:19:17^A2011.04.26 10:19:41
2011.04.26 10:19:18^A2011.04.26 10:34:41

the result I want is how many people is in on a specific time.

e.g. on 2011.04.26 10:19:08, there 3 visitor on course there one in 19:06, and 2 in 19:08.

and 2011.04.26 10:19:18 is 9, course ten but one leave on 2011.04.26 10:19:17

the desired result for piece is

2011.04.26 10:19:06 1
2011.04.26 10:19:08 3
2011.04.26 10:19:09 5
2011.04.26 10:19:11 6
2011.04.26 10:19:12 7
2011.04.26 10:19:13 8
2011.04.26 10:19:17 9
2011.04.26 10:19:18 9

Any help is much appreciated and welcome.

回答1:

You can try this on hive (assume the table name is test_log):

select /*+ MAPJOIN(driven) */ driven.time, count(*)    
from         
    (select time 
     from 
     (select begin_time time from test_log union all 
      select end_time time from test_log) u  
     group by time) driven
join test_log l on true
where
    driven.time between l.begin_time and l.end_time
group by driven.time

Probably not the best solution but at least it works. You can add some filter on the driven subquery to reduce the data set.