Grouping rows considering “difference” between row

2020-07-11 05:47发布

I've got a table which contains start-times (using number in example to keep it simple), and durations of events.

I would like to identify "blocks" and their start- and the end-time.
Whenever the difference between the end-time (start-time + duration) of the previous row (sorted by start-time) and the start-time of the current row is >=5, a new "block" should begin.

This is my test-data, including an attempt of a graphical explanation in the comments:

WITH test_data AS (
  SELECT  0 s, 2 dur FROM dual UNION ALL   --# ■■
  SELECT  2  , 2     FROM dual UNION ALL   --#   ■■
  SELECT 10  , 1     FROM dual UNION ALL   --#           ■
  SELECT 13  , 4     FROM dual UNION ALL   --#              ■■■■
  SELECT 15  , 4     FROM dual             --#                ■■■■
)
--# Should return
--#   0 ..  4                              --# ■■■■
--#  10 .. 19                              --#           ■■■■■■■■■

The first block starts at 0 and ends at 4. Since the difference to the next row is >=5, start another block at 10 which ends at 19.


I can identify the first row of a block, using LAG, but I have not yet found out how to proceed.

And I could solve the problem in a PL/SQL-loop, but I'm trying to avoid that for performance reasons.


Any suggestions on how to write this query?

Thanks in advance, Peter

4条回答
祖国的老花朵
2楼-- · 2020-07-11 05:55

There is a fantastic book by Richard Snodgrass which may help: Developing Time-Oriented Database Applications in SQL (free to download) which I have found invaluable when dealing with time in databases.

Have a look on Richards page for links to some book corrections and the associated CD-ROM in zip format.

查看更多
smile是对你的礼貌
3楼-- · 2020-07-11 05:58

The code gets a bit complicated with a number of subqueries, etc. The may be instances of data where this doesn't work but I can't think of any off the top of my head.

Working with temporal data is always a pain!

WITH test_data AS (
  SELECT  0 s, 2 dur FROM dual UNION ALL   --# ■■
  SELECT  2  , 2     FROM dual UNION ALL   --#   ■■
  SELECT 10  , 1     FROM dual UNION ALL   --#           ■
  SELECT 13  , 4     FROM dual UNION ALL   --#              ■■■■
  SELECT 15  , 4     FROM dual             --#                ■■■■
)
select 
-- Group on each block
  min(start_time) as s, 
  max(end_time) - min(start_time) as dur
from (
  select 
    start_time,
    duration, 
    end_time, 
-- number the blocks sequentially 
    sum(is_block_start) over (order by start_time) as block_num
  from (
    select 
      start_time, 
      duration, 
      end_time, 
-- Mark the start of each block
      case 
        when nvl2(prev_end_time, start_time - prev_end_time,5) >= 5 
        then 1 else 0 end as is_block_start
    from (
      select 
        s as start_time, 
        dur as duration, 
        s+dur as end_time,
        lag(s+dur) over (order by s) prev_end_time
      from test_data
    )
  )
)
group by block_num
查看更多
霸刀☆藐视天下
4楼-- · 2020-07-11 05:59

In MS-SQL I would use ROW_NUMBER() OVER(ORDER BY starttime) AS Rank to rank the rows on start time.

Then, I would write a query to join each line to the line with previous Rank and set a flag if the difference is bigger than five or NULL (first row).

Then, I would select all rows having this flag which are start rows, and for this subset repeat the process of numbering rows and joining to the next row to get the time spans:

blockstarttime1 nextstarttime1 (=starttime2)
blockstarttime2 nextstarttime2 (=starttime3)
blockstarttime3 NULL

Finally, this dataset can be joined to the original data with a WHERE starttime BETWEEN blockstarttime and nextstarttime to partition the results.

Up to you to translate this to Oracle...

查看更多
5楼-- · 2020-07-11 06:12

I use subqueries with analytics to identify and group contiguous ranges:

SQL> WITH test_data AS (
  2    SELECT  0 s, 2 dur FROM dual UNION ALL   --# ■■
  3    SELECT  2  , 2     FROM dual UNION ALL   --#   ■■
  4    SELECT 10  , 1     FROM dual UNION ALL   --#           ■
  5    SELECT 13  , 4     FROM dual UNION ALL   --#              ■■■■
  6    SELECT 15  , 4     FROM dual             --#                ■■■■
  7  )
  8  SELECT MIN(s) "begin", MAX(s + dur) "end"
  9    FROM (SELECT s, dur, SUM(gap) over(ORDER BY s) my_group
 10             FROM (SELECT s, dur,
 11                           CASE
 12                              WHEN lag(s + dur) over(ORDER BY s) >= s - 5 THEN
 13                               0
 14                              ELSE
 15                               1
 16                           END gap
 17                      FROM test_data
 18                     ORDER BY s))
 19   GROUP BY my_group;

     begin        end
---------- ----------
         0          4
        10         19
查看更多
登录 后发表回答