islands and gaps tsql

2019-06-05 15:44发布

问题:

I have been struggling with a problem that should be pretty simple actually but after a full week of reading, googling, experimenting and so on, my colleague and we cannot find the proper solution. :(

The problem: We have a table with two values: an employeenumber (P_ID, int) <--- identification of employee a date (starttime, datetime) <--- time employee checked in

  • We need to know what periods each employee has been working.
  • When two dates are less then @gap days apart, they belong to the same period
  • For each employee there can be multiple records for any given day but I just need to know which dates he worked, I am not interested in the time part
  • As soon as there is a gap > @gap days, the next date is considered the start of a new range
  • A range is at least 1 day (example: 21-9-2011 | 21-09-2011) but has no maximum length. (An employee checking in every @gap - 1 days should result in a period from the first day he checked in until today)

What we think we need are the islands in this table where the gap in days is greater than @variable (@gap = 30 means 30 days)

So an example:

SOURCETABLE:

P_ID  | starttime
------|------------------
12121 | 24-03-2009 7:30
12121 | 24-03-2009 14:25 
12345 | 27-06-2011 10:00
99999 | 01-05-2012 4:50 
12345 | 27-06-2011 10:30
12345 | 28-06-2011 11:00
98765 | 13-04-2012 10:00
12345 | 21-07-2011 9:00
99999 | 03-05-2012 23:15
12345 | 21-09-2011 12:00
45454 | 12-07-2010 8:00
12345 | 21-09-2011 17:00
99999 | 06-05-2012 11:05
99999 | 20-05-2012 12:45
98765 | 26-04-2012 16:00
12345 | 07-07-2012 14:00
99999 | 01-06-2012 13:55
12345 | 13-08-2012 13:00

Now what I need as a result is:

PERIODS:

P_ID  |   Start    |    End
-------------------------------
12121 | 24-03-2009 | 24-03-2009
12345 | 27-06-2012 | 21-07-2012
12345 | 21-09-2012 | 21-09-2012
12345 | 07-07-2012 | (today) OR 13-08-2012  <-- (less than @gap days ago) OR (last date in table)
45454 | 12-07-2010 | 12-07-2010
45454 | 17-06-2012 | 17-06-2012 
98765 | 13-04-2012 | 26-04-2012
99999 | 01-05-2012 | 01-06-2012

I hope this is clear this way, I already thank you for reading this far, it would be great if you could contribute!

回答1:

I've done a rough script that should get you started. Haven't bothered refining the datetimes and the endpoint comparisons might need tweaking.

select 
    P_ID,
    src.starttime,
    endtime = case when src.starttime <> lst.starttime or lst.starttime < DATEADD(dd,-1 * @gap,GETDATE()) then lst.starttime else GETDATE() end,
    frst.starttime,
    lst.starttime
from @SOURCETABLE src
outer apply (select starttime = MIN(starttime) from @SOURCETABLE sub where src.p_id = sub.p_id and sub.starttime > DATEADD(dd,-1 * @gap,src.starttime)) frst
outer apply (select starttime = MAX(starttime) from @SOURCETABLE sub where src.p_id = sub.p_id and src.starttime > DATEADD(dd,-1 * @gap,sub.starttime)) lst
where src.starttime = frst.starttime
order by P_ID, src.starttime

I get the following output, which is a litle different to yours, but I think its ok:

P_ID        starttime               endtime                 starttime               starttime
----------- ----------------------- ----------------------- ----------------------- -----------------------
12121       2009-03-24 07:30:00.000 2009-03-24 14:25:00.000 2009-03-24 07:30:00.000 2009-03-24 14:25:00.000
12345       2011-06-27 10:00:00.000 2011-07-21 09:00:00.000 2011-06-27 10:00:00.000 2011-07-21 09:00:00.000
12345       2011-09-21 12:00:00.000 2011-09-21 17:00:00.000 2011-09-21 12:00:00.000 2011-09-21 17:00:00.000
12345       2012-07-07 14:00:00.000 2012-07-07 14:00:00.000 2012-07-07 14:00:00.000 2012-07-07 14:00:00.000
12345       2012-08-13 13:00:00.000 2012-08-16 11:23:25.787 2012-08-13 13:00:00.000 2012-08-13 13:00:00.000
45454       2010-07-12 08:00:00.000 2010-07-12 08:00:00.000 2010-07-12 08:00:00.000 2010-07-12 08:00:00.000
98765       2012-04-13 10:00:00.000 2012-04-26 16:00:00.000 2012-04-13 10:00:00.000 2012-04-26 16:00:00.000

The last two output cols are the results of the outer apply sections, and are just there for debugging.

This is based on the following setup:

declare @gap int
set @gap = 30

set dateformat dmy
-----P_ID----|----starttime----
declare @SOURCETABLE table (P_ID int, starttime datetime)
insert @SourceTable values 
(12121,'24-03-2009 7:30'),
(12121,'24-03-2009 14:25'),
(12345,'27-06-2011 10:00'),
(12345,'27-06-2011 10:30'),
(12345,'28-06-2011 11:00'),
(98765,'13-04-2012 10:00'),
(12345,'21-07-2011 9:00'),
(12345,'21-09-2011 12:00'),
(45454,'12-07-2010 8:00'),
(12345,'21-09-2011 17:00'),
(98765,'26-04-2012 16:00'),
(12345,'07-07-2012 14:00'),
(12345,'13-08-2012 13:00')

UPDATE: Slight rethink. Now uses a CTE to work out the gaps forwards and backwards from each item, then aggregates those:

--Get the gap between each starttime and the next and prev (use 999 to indicate non-closed intervals)
;WITH CTE_Gaps As ( 
    select
        p_id,
        src.starttime,
        nextgap = coalesce(DATEDIFF(dd,src.starttime,nxt.starttime),999), --Gap to the next entry
        prevgap = coalesce(DATEDIFF(dd,prv.starttime,src.starttime),999), --Gap to the previous entry
        isold = case when DATEDIFF(dd,src.starttime,getdate()) > @gap then 1 else 0 end --Is starttime more than gap days ago?
    from
        @SOURCETABLE src
        cross apply (select starttime = MIN(starttime) from @SOURCETABLE sub where src.p_id = sub.p_id and sub.starttime > src.starttime) nxt
        cross apply (select starttime = max(starttime) from @SOURCETABLE sub where src.p_id = sub.p_id and sub.starttime < src.starttime) prv   
)
--select * from CTE_Gaps
select
        p_id,
        starttime = min(gap.starttime),
        endtime = nxt.starttime
    from
        CTE_Gaps gap
        --Find the next starttime where its gap to the next > @gap
        cross apply (select starttime = MIN(sub.starttime) from CTE_Gaps sub where gap.p_id = sub.p_id and sub.starttime >= gap.starttime and sub.nextgap > @gap) nxt
group by P_ID, nxt.starttime
order by P_ID, nxt.starttime


回答2:

Jon most definitively has shown us the right direction. Performance was horrible though (4million+ records in the database). And it looked like we were missing some information. With all that we learned from you we came up with the solution below. It uses elements of all the proposed answers and cycles through 3 temptables before finally spewing results but performance is good enough, as well as the data it generates.

declare @gap int
declare @Employee_id int

set @gap = 30   
set dateformat dmy
--------------------------------------------------------------- #temp1 --------------------------------------------------
CREATE TABLE #temp1 ( EmployeeID int, starttime date)
INSERT INTO #temp1 ( EmployeeID, starttime)

select distinct ck.Employee_id, 
                cast(ck.starttime as date)
from SERVER1.DB1.dbo.checkins pd
        inner join SERVER1.DB1.dbo.Team t on ck.team_id = t.id
where t.productive = 1

--------------------------------------------------------------- #temp2 --------------------------------------------------

create table #temp2 (ROWNR int, Employeeid int, ENDOFCHECKIN datetime, FIRSTCHECKIN datetime)
INSERT INTO #temp2 

select Row_number() OVER (partition by EmployeeID ORDER BY t.prev) + 1 as ROWNR,
             EmployeeID,
             DATEADD(DAY, 1, t.Prev) AS start_gap,
           DATEADD(DAY, 0, t.next) AS end_gap
from 
             (
                    select a.EmployeeID,
                                  a.starttime as Prev, 
                                  (
                                  select min(b.starttime)
                                  from #temp1 as b
                                  where starttime > a.starttime and b.EmployeeID = a.EmployeeID 
                                  ) as Next
from #temp1 as a) as t

where  datediff(day, prev, next ) > 30
group by     EmployeeID,
                    t.Prev,
                    t.next
union -- add first known date for Employee 

select      1 as ROWNR,
            EmployeeID,
            NULL,
            min(starttime)
from #temp1 ct
group by ct.EmployeeID

--------------------------------------------------------------- #temp3 --------------------------------------------------

create table #temp3 (ROWNR int, Employeeid int, ENDOFCHECKIN datetime, STARTOFCHECKIN datetime)
INSERT INTO #temp3

select  ROWNR,
        Employeeid,
        ENDOFCHECKIN,
        FIRSTCHECKIN
from #temp2 

union -- add last known date for Employee 

select       (select count(*) from #temp2 b where Employeeid = ct.Employeeid)+1 as ROWNR,
             ct.Employeeid,
            (select dateadd(d,1,max(starttime)) from #temp1 c where Employeeid = ct.Employeeid),
             NULL
from #temp2 ct
group by ct.EmployeeID

---------------------------------------finally check our data-------------------------------------------------


select              a1.Employeeid,
                    a1.STARTOFCHECKIN as STARTOFCHECKIN,
                    ENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN a1.ENDOFCHECKIN ELSE b1.ENDOFCHECKIN END,
                    year(a1.STARTOFCHECKIN) as JaarSTARTOFCHECKIN,
                    JaarENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN  year(a1.ENDOFCHECKIN) ELSE  year(b1.ENDOFCHECKIN) END,
                    Month(a1.STARTOFCHECKIN) as MaandSTARTOFCHECKIN,
                    MaandENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN  month(a1.ENDOFCHECKIN) ELSE  month(b1.ENDOFCHECKIN) END,
                    (year(a1.STARTOFCHECKIN)*100)+month(a1.STARTOFCHECKIN) as JaarMaandSTARTOFCHECKIN,
                    JaarMaandENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN (year(a1.ENDOFCHECKIN)*100)+month(a1.STARTOFCHECKIN) ELSE (year(b1.ENDOFCHECKIN)*100)+month(b1.ENDOFCHECKIN) END,
                    datediff(M,a1.STARTOFCHECKIN,b1.ENDOFCHECKIN) as MONTHSCHECKEDIN
from #temp3 a1
       full outer join #temp3 b1 on a1.ROWNR = b1.ROWNR -1 and a1.Employeeid = b1.Employeeid
where not (a1.STARTOFCHECKIN is null AND b1.ENDOFCHECKIN is null) 
order by a1.Employeeid, a1.STARTOFCHECKIN