SQL Server separate overlapping dates

2020-03-08 06:17发布

问题:

I have a table with the following data:

If the dates overlap for a Type, I want to return a separate row for the overlapping period so end up with the following results:

回答1:

Consider treating the ranges as geometric lines and then using SQL Server's geometry tools.

First, I need to get your data into a table I can work with:

declare @spans table (
    type char(1),
    start datetime,
    stop datetime
);

insert @spans values 
    ('S', '2010-01-01', '2010-01-14'),
    ('S', '2010-01-10', '2010-01-31'),
    ('A', '2010-01-05', '2010-01-30'),
    ('A', '2010-01-24', '2010-02-06'),
    ('T', '2010-01-20', '2010-01-27'),
    ('T', '2010-01-28', '2010-01-30');

Unfortunately, SQL Server's spatial tools are limited on one respect. When you need to query individual shapes or points inside a 'geometry' value, you have to pass the index number to get it. Until microsoft comes out with a TVF that outputs all the shapes, we resort to a numbers table to get the job done.

Swap this out with your preferred method to create a numbers table:

declare     @numbers table (i int);
insert      @numbers    
select      i = row_number() over (order by (select null))
from        @spans a, @spans b;

Now you're ready for the main query.

In 'geoAggregates':

  • I convert the dates to float values. I add one to the stop value to make the geometry work in the real-number space, as opposed to 'integer space' where there are no values between units.
  • I convert to float values to geometric points
  • I make lines out of these points, and I also hold the points together into a single geometric space called 'splitters'
  • I aggregate the lines and I aggregate the splitters. For the lines this merges them into one range if they overlap. For the splitters, it just further condenses them into one collection.

In the outer query:

  • I split the lines with the splitters. The splitters need a buffer to give them non-zero length in order to actually perform the split.
  • I extract the collections of splitted lines into individual lines
  • I envelope the lines to ensure they are only represented by their endpoints.
  • I parse the endpoints to get the float values, undo the added value for stop, and round these to undo the buffer effects (and the float storage effects).
  • I convert these float values back into date representations.

Here' the code. It outputs as you expect, except for your type-o in the second row of 'A'.

with

    geoAggregates as (

        select      type,
                    lines = geometry::UnionAggregate(line),
                    splitters = geometry::UnionAggregate(splitters)
        from        @spans
        cross apply (select 
                        startF = convert(float, start),
                        stopF = convert(float, stop) + 1
                    ) prepare
        cross apply (select 
                        startP = geometry::Point(startF, 0, 0),
                        stopP = geometry::Point(stopF, 0, 0)
                    ) pointify
        cross apply (select 
                        line = startP.STUnion(stopP).STEnvelope(),
                        splitters = startP.STUnion(stopP)
                    ) lineify
        group by    type 

    )

    select      type, 
                start,
                stop 
    from        geoAggregates ga
    cross apply (select 
                    splitted = ga.lines.STDifference(splitters.STBuffer(0.001))
                ) sp
    join        @numbers n on n.i between 1 and sp.splitted.STNumGeometries()
    cross apply (select 
                    line = sp.splitted.STGeometryN(i).STEnvelope()
                ) l
    cross apply (select 
                    start = convert(datetime, round(l.line.STPointN(1).STX,0)),
                    stop = convert(datetime, round(l.line.STPointN(3).STX - 1,0)) 
                ) dateify
    order by    type, start;


回答2:

I'm ignoring PKey for now since I'm not sure if it's actually relevant to the question.

This solves the problem:

declare @t table (PKey int,Start date,[End] date,Type char(1))
insert into @t(PKey,Start,[End],Type) values
(1,'20100101','20100114','S'),
(2,'20100110','20100131','S'),
(3,'20100105','20100130','A'),
(4,'20100124','20100206','A'),
(5,'20100120','20100127','T'),
(6,'20100128','20100130','T')

;With EndDates as (
    select [End],Type from @t
    union all
    select DATEADD(day,-1,Start),Type from @t
), Periods as (
    select Type,MIN(Start) as Start,
                (select MIN([End]) from EndDates e
                 where e.Type = t.Type and
                 e.[End] >= MIN(Start)) as [End]
    from
        @t t
    group by Type
    union all
    select p.Type,DATEADD(day,1,p.[End]),e.[End]
    from
        Periods p
            inner join
        EndDates e
            on
                p.Type = e.Type and
                p.[End] < e.[End]
    where
        not exists (select * from EndDates e2 where
                e2.Type = p.Type and
                e2.[End] > p.[End] and
                e2.[End] < e.[End])
)
select * from Periods
order by Type,Start

First, we create a CTE called EndDates that contains all dates which might be the end of a period - these are either the end dates that we already have in our data, or they're the day before one of the start dates in our data.

We then build up the periods - first we find the first period for any particular type - which is we take the earliest start date, and the earliest possible end date that comes after the start date.

Then, recursively, we build up additional periods by starting new periods the day after existing ones have ended, and finding the earliest end date that follows that date.

And then, basically, we're done. Results:

Type Start      End
---- ---------- ----------
A    2010-01-05 2010-01-23
A    2010-01-24 2010-01-30
A    2010-01-31 2010-02-06
S    2010-01-01 2010-01-09
S    2010-01-10 2010-01-14
S    2010-01-15 2010-01-31
T    2010-01-20 2010-01-27
T    2010-01-28 2010-01-30

Which doesn't exactly match what's in your question but I assume the A row ending on 30th February was a typo.

(I'd recommend renaming your End column though, because using reserved words for column names can turn into a real pain)