How to retrieve trips from historical data?

2020-03-24 06:21发布

问题:

I have the following table mytable in Hive:

id    radar_id     car_id     datetime
1     A21          123        2017-03-08 17:31:19.0
2     A21          555        2017-03-08 17:32:00.0
3     A21          777        2017-03-08 17:33:00.0
4     B15          123        2017-03-08 17:35:22.0
5     B15          555        2017-03-08 17:34:05.0
5     B15          777        2017-03-08 20:50:12.0
6     A21          123        2017-03-09 11:00:00.0
7     C11          123        2017-03-09 11:10:00.0
8     A21          123        2017-03-09 11:12:00.0
9     A21          555        2017-03-09 11:12:10.0
10    B15          123        2017-03-09 11:14:00.0
11    C11          555        2017-03-09 11:20:00.0

I want to get the routes of cars passing through radars A21 and B15 within the same trip. For example, if the date is different for the same car_id, then it is not the same trip. Basically, I want to consider that the maximum time difference between radars A21 and B15 for the same vehicle should be 30 minutes. If it's bigger, then the trip is not the same, like for example for the car_id 777.

My final goal is to count the average number of trips per day (non-unique, so if the same car passed 2 times by the same route, then it should be calculated 2 times).

The expected result is the following one:

radar_start   radar_end       avg_tripscount_per_day
A21           B15             1.5

On the date 2017-03-08 there are 2 trips between radars A21 and B15 (car 777 is not considered due to 30 minutes limit), while on the date 2017-03-09 there is only 1 trip. The average is 2+1=1.5 trips per day.

How can I get this result? Basically, I do not know how to introduce 30 minutes limit in the query and how to group rides by radar_start and radar_end.

Thanks.

Update:

  1. The trip is registered at the date it started.
  2. If the car was triggered by radar A21 at 2017-03-08 23:55 and by radar B15 at 2017-03-09 00:15, then it should be considered as the same trip registered for the date 2017-03-08.
  3. In case of ids 6 and 8 the same car 123 passed by A21 two times, and then it turned to B15 (id 10). The last ride with id 8 should be considered. So, 8-10. Thus, the closest previous to B15. The interpretation is that a car passed by A21 two times and the second time is turned to B15.

回答1:

select  count(*) / count(distinct to_date(datetime))    as trips_per_day

from   (select  radar_id
               ,datetime
               ,lead(radar_id) over w  as next_radar_id
               ,lead(datetime) over w  as next_datetime                    

        from    mytable

        where   radar_id in ('A21','B15')

        window  w as 
                (
                    partition by  car_id
                    order by      datetime
                )
        ) t

where   radar_id        = 'A21'
    and next_radar_id   = 'B15'
    and datetime + interval '30' minutes >= next_datetime
;

+----------------+
| trips_per_day  |
+----------------+
| 1.5            |
+----------------+

P.s.
If your version does not support intervals, the last code record could be replaced by -
and to_unix_timestamp(datetime) + 30*60 > to_unix_timestamp(next_datetime)



回答2:

I missed that you're using Hive so started writing query for SQL-Server, but maybe it will help for you. Try something like this:

QUERY

select radar_start, 
       radar_end, 
       convert(decimal(6,3), count(*)) / convert(decimal(6,3), count(distinct dt)) as avg_tripscount_per_day
from (
    select 
        t1.radar_id as radar_start,
        t2.radar_id as radar_end,
        convert(date, t1.[datetime]) dt,
        row_number() over (partition by t1.radar_id, t1.car_id, convert(date, t1.[datetime]) order by t1.[datetime] desc) rn1,
        row_number() over (partition by t2.radar_id, t2.car_id, convert(date, t2.[datetime]) order by t2.[datetime] desc) rn2
    from trips as t1
    join trips as t2 on t1.car_id = t2.car_id 
        and datediff(minute,t1.[datetime], t2.[datetime]) between 0 and 30
        and t1.radar_id = 'A21' 
        and t2.radar_id = 'B15'
)x
where rn1 = 1 and rn2 = 1
group by radar_start, radar_end

OUPUT

radar_start radar_end   avg_tripscount_per_day
A21         B15         1.5000000000

SAMPLE DATA

create table trips
(
    id int,
    radar_id char(3),
    car_id int,
    [datetime] datetime
)
insert into trips values
(1,'A21',123,'2017-03-08 17:31:19.0'),
(2,'A21',555,'2017-03-08 17:32:00.0'),
(3,'A21',777,'2017-03-08 17:33:00.0'),
(4,'B15',123,'2017-03-08 17:35:22.0'),
(5,'B15',555,'2017-03-08 17:34:05.0'),
(5,'B15',777,'2017-03-08 20:50:12.0'),

(6,'A21',123,'2017-03-09 11:00:00.0'),
(7,'C11',123,'2017-03-09 11:10:00.0'),
(8,'A21',123,'2017-03-09 11:12:00.0'),

(9,'A21',555,'2017-03-09 11:12:10.0'),
(8,'B15',123,'2017-03-09 11:14:00.0'),
(9,'C11',555,'2017-03-09 11:20:00.0')


标签: sql hive hiveql