Flatten/merge overlapping time intervals

2020-07-17 14:51发布

问题:

I have a 'Service' table with millions of rows. Each row corresponds to a service provided by a staff in a given date and time interval (Each row has a unique ID). There are cases where a staff might provide services in overlapping time frames. I need to write a query that merges overlapping time intervals and returns the data in the format shown below.

I tried grouping by StaffID and Date fields and getting the Min of BeginTime and Max of EndTime but that does not account for the non-overlapping time frames. How can I accomplish this? Again, the table contains several million records so a recursive CTE approach might have performance issues. Thanks in advance.

Service Table

ID    StaffID  Date        BeginTime EndTime
1     101      2014-01-01  08:00     09:00
2     101      2014-01-01  08:30     09:30
3     101      2014-01-01  18:00     20:30
4     101      2014-01-01  19:00     21:00

Output

StaffID Date        BeginTime EndTime
101     2014-01-01  08:00     09:30
101     2014-01-01  18:00     21:00

Here is another sample data set with a query proposed by a contributor. http://sqlfiddle.com/#!6/bfbdc/3

The first two rows in the results set should be merged into one row (06:00-08:45) but it generates two rows (06:00-08:30 & 06:00-08:45)

回答1:

I only came up with a CTE query as the problem is there may be a chain of overlapping times, e.g. record 1 overlaps with record 2, record 2 with record 3 and so on. This is hard to resolve without CTE or some other kind of loops, etc. Please give it a go anyway.

The first part of the CTE query gets the services that start a new group and are do not have the same starting time as some other service (I need to have just one record that starts a group). The second part gets those that start a group but there's more then one with the same start time - again, I need just one of them. The last part recursively builds up on the starting group, taking all overlapping services.

Here is SQLFiddle with more records added to demonstrate different kinds of overlapping and duplicate times.

I couldn't use ServiceID as it would have to be ordered in the same way as BeginTime.

;with flat as
(
 select StaffID, ServiceDate, BeginTime, EndTime, BeginTime as groupid 
 from services S1
 where not exists (select * from services S2 
 where S1.StaffID = S2.StaffID 
 and S1.ServiceDate = S2.ServiceDate 
 and S2.BeginTime <= S1.BeginTime and S2.EndTime <> S1.EndTime
 and S2.EndTime > S1.BeginTime)

  union all

  select StaffID, ServiceDate, BeginTime, EndTime, BeginTime as groupid 
  from services S1
 where exists (select * from services S2 
 where S1.StaffID = S2.StaffID 
 and S1.ServiceDate = S2.ServiceDate 
 and S2.BeginTime = S1.BeginTime and S2.EndTime > S1.EndTime)
   and not exists (select * from services S2 
 where S1.StaffID = S2.StaffID 
 and S1.ServiceDate = S2.ServiceDate 
 and S2.BeginTime < S1.BeginTime
 and S2.EndTime > S1.BeginTime)

 union all

 select S.StaffID, S.ServiceDate, S.BeginTime, S.EndTime, flat.groupid 
 from flat
 inner join services S 
 on flat.StaffID = S.StaffID
 and flat.ServiceDate = S.ServiceDate
 and flat.EndTime > S.BeginTime
 and flat.BeginTime < S.BeginTime and flat.EndTime < S.EndTime
)

select StaffID, ServiceDate, MIN(BeginTime) as begintime, MAX(EndTime) as endtime 
from flat
group by StaffID, ServiceDate, groupid
order by StaffID, ServiceDate, begintime, endtime


回答2:

Elsewhere I've answered a similar Date Packing question with a geometric strategy. Namely, I interperet the date ranges as a line, and utilize geometry::UnionAggregate to merge the ranges.

Your question has two peculiarities though. First, it calls for sql-server-2008. geometry::UnionAggregate is not then avialable. However, download the microsoft library at https://github.com/microsoft/SQLServerSpatialTools and load it in as a clr assembly to your instance and you have it available as dbo.GeometryUnionAggregate.

But the real peculiarity that has my interest is the concern that you have several million rows to work with. So I thought I'd repeat the strategy here but with an added technique to improve it's performance. This technique will work well if you have a lot of your StaffID/date subsets that are the same.


First, let's build a numbers table. Swap this out with your favorite way to do it.

select  i = row_number() over (order by (select null)) 
into    #numbers
from    @services; -- where i put your data

Then convert the dates to floats and use those floats to create geometrical points.

These points can then be turned into lines via STUnion and STEnvelope.

With your ranges now represented as geometric lines, merge them via UnionAggregate. The resulting geometry object 'lines' might contain multiple lines. But any overlapping lines turn into one line.

select      s.StaffID, 
            s.Date,
            linesWKT = geometry::UnionAggregate(line).ToString() 

            -- If you have SQLSpatialTools installed then:
            -- linesWKT = dbo.GeometryUnionAggregate(line).ToString() 

into        #aggregateRangesToGeo
from        @services s
cross apply (select 
                beginTimeF = convert(float, convert(datetime,beginTime)),
                endTimeF = convert(float, convert(datetime,endTime))
            ) prepare
cross apply (select
                beginPt = geometry::Point(beginTimeF, 0, 0),
                endPt = geometry::Point(endTimeF, 0, 0)
            ) pointify
cross apply (select 
                line = beginPt.STUnion(endPt).STEnvelope()
            ) lineify
group by    s.StaffID,
            s.Date;

You have one 'lines' object for each staffId/date combo. But depending on your dataset, there may be many 'lines' objects that are the same between these combos. This may very well be true if staff are expected to follow a routine and data is recorded to the nearest whatever.

So get a distinct lising of 'lines' objects. This should improve performance.

From this, extract the individual lines inside 'lines'. Envelope the lines, which ensures that the lines are stored only as their endpoints. Read the endpoint x values and convert them back to their time representations.

Keep the WKT representation to join it back to the combos later on.

select      lns.linesWKT,
            beginTime = convert(time, convert(datetime, ap.beginTime)),
            endTime = convert(time, convert(datetime, ap.endTime))
into        #parsedLines
from        (select distinct linesWKT from #aggregateRangesToGeo) lns
cross apply (select 
                lines = geometry::STGeomFromText(linesWKT, 0)
            ) geo
join        #numbers n on n.i between 1 and geo.lines.STNumGeometries()
cross apply (select 
                line = geo.lines.STGeometryN(n.i).STEnvelope()
            ) ln
cross apply (select 
                beginTime = ln.line.STPointN(1).STX,
                endTime = ln.line.STPointN(3).STX
            ) ap;

Now just join your parsed data back to the StaffId/Date combos.

select      ar.StaffID,
            ar.Date,
            pl.beginTime, 
            pl.endTime
from        #aggregateRangesToGeo ar
join        #parsedLines pl on ar.linesWKT = pl.linesWKT
order by    ar.StaffID, 
            ar.Date,
            pl.beginTime;