I have a table which contains an ID
and a Date
for an event. Each row is for one date. I am trying to determine consecutive date ranges and consolidate output to show the ID,StartDate,EndDate
ID Date
200236 2011-01-02 00:00:00.000
200236 2011-01-03 00:00:00.000
200236 2011-01-05 00:00:00.000
200236 2011-01-06 00:00:00.000
200236 2011-01-07 00:00:00.000
200236 2011-01-08 00:00:00.000
200236 2011-01-09 00:00:00.000
200236 2011-01-10 00:00:00.000
200236 2011-01-11 00:00:00.000
200236 2011-01-12 00:00:00.000
200236 2011-01-13 00:00:00.000
200236 2011-01-15 00:00:00.000
200236 2011-01-16 00:00:00.000
200236 2011-01-17 00:00:00.000
Output would look like:
ID StartDate EndDate
200236 2011-01-02 2011-01-03
200236 2011-01-05 2011-01-13
200236 2011-01-15 2011-01-17
Any thoughts on how to handle this in SQL Server 2000?
SELECT ...
FROM ...
WHERE date_column BETWEEN '2011-01-02' AND '2011-01-15'
perhaps? Reference
Or you can do a sub-query and link the next record using a MAX where date is <= current date:
SELECT id, date, (SELECT MAX(date) FROM mytable WHERE date <= mytable.date) AS nextDate
FROM mytable
Or use:
SELECT TOP 1 date
FROM mytable
WHERE date <= mytable.date AND id <> mytable.id
ORDER BY date
As the sub-query so it grabs the next date in line after the current record.
I've just done this similar thing in SQL Server 2008. I think the following translation will work for SQL Server 2000:
-- Create table variable
DECLARE @StartTable TABLE
(
rowid INT IDENTITY(1,1) NOT NULL,
userid int,
startDate date
)
Insert Into @StartTable(userid, startDate)
--This finds the start dates by finding unmatched values
SELECT t1.ID, t1.[Date]
FROM Example As t1
LEFT OUTER JOIN Example As t2 ON t1.ID=t2.ID
And DateAdd(day, 1, t2.[Date]) = t1.[Date]
WHERE t2.[Date] Is NULL
ORDER BY t1.ID, t1.[Date]
-- Create table variable
DECLARE @EndTable TABLE
(
rowid INT IDENTITY(1,1) NOT NULL,
userid int,
endDate date
)
Insert Into @EndTable(userid, endDate)
--This finds the end dates by getting unmatched values
SELECT t1.ID, t1.[Date]
FROM Example As t1
LEFT OUTER JOIN Example As t2 ON t1.ID=t2.ID
And DateAdd(day, -1, t2.[Date]) = t1.[Date]
WHERE t2.[Date] IS NULL
ORDER BY t1.ID, t1.[Date]
Select eT.userid, startDate, endDate
From @EndTable eT
INNER JOIN @StartTable sT On eT.userid = sT.userid
AND eT.rowid = sT.rowid;
So as you can see, I created two table variables, one for starts and one for ends, by self-joining the table on the date either just prior to or just after the date in the [Date] column. This means that I'm selecting only records that don't have a date prior (so these would be at the beginning of a period) for the Start Table and those that have no date following (so these would be at the end of a period) for the End Table.
When these are inserted into the table variable, they are numbered in sequence because of the Identity column. Then I join the two table variables together. Because they are ordered, the start and end dates should always match up properly.
This solution works for me because I have at most one record per ID per day and I am only interested in days, not hours, etc. Even though it is several steps, I like it because it is conceptually simple and eliminates matched records without having cursors or loops. I hope it will work for you too.
This SO Question might help you. I linked directly to Rob Farley's answer as I feel this is a similar problem.
One approach you can take is to add a field that indicates the next date in the sequence. (Either add it to your current table or use a temporary table, store the underlying data to the temp table and then update the next date in the sequence).
Your starting data structure would look something like this:
ID, PerfDate, NextDate
200236, 2011-01-02, 2011-01-03
200236, 2011-01-03, 2011-01-04
etc.
You can then use a series of correlated subqueries to roll the data up into the desired output:
SELECT ID, StartDate, EndDate
FROM (
SELECT DISTINCT ID, PerfDate AS StartDate,
(SELECT MIN([PerfDate]) FROM [SourceTable] S3
WHERE S3.ID = S1.ID
AND S3.NextDate > S1.PerfDate
AND ISNULL(
(SELECT MIN(PerfDate)
FROM [SourceTable] AS S4
WHERE S4.ID = S1.ID
AND S4.NextDate > S3.NextDate), S3.NextDate + 1) > S3.NextDate) AS EndDate
FROM [SourceTable] S1
WHERE
ISNULL(
(SELECT MAX(NextDate)
FROM [SourceTable] S2
WHERE S2.ID = S1.ID
AND S2.PerfDate < S1.PerfDate), PerfDate -1) < S1.PerfDate)q
ORDER BY q.ID, q.StartDate
This is the way I've done it in the past. It's a two step process:
- Build the set of candidate contiguous periods
- If there are any overlapping periods, delete all but the longest such period.
Here's a script that shows how it's done. You might be able to pull it off in a single [bug, ugly] query, but trying to do that makes my head hurt. I'm using temp tables as it makes the debugging a whole lot easier.
drop table #source
create table #source
(
id int not null ,
dtCol datetime not null ,
-----------------------------------------------------------------------
-- ASSUMPTION 1: Each date must be unique for a given ID value.
-----------------------------------------------------------------------
unique clustered ( id , dtCol ) ,
-----------------------------------------------------------------------
-- ASSUMPTION 2: The datetime column only represents a day.
-- The value of the time component is always 00:00:00.000
-----------------------------------------------------------------------
check ( dtCol = convert(datetime,convert(varchar,dtCol,112),112) ) ,
)
go
insert #source values(1,'jan 1, 2011')
insert #source values(1,'jan 4, 2011')
insert #source values(1,'jan 5, 2011')
insert #source values(2,'jan 1, 2011')
insert #source values(2,'jan 2, 2011')
insert #source values(2,'jan 3, 2011')
insert #source values(2,'jan 5, 2011')
insert #source values(3,'jan 1, 2011')
insert #source values(4,'jan 1, 2011')
insert #source values(4,'jan 2, 2011')
insert #source values(4,'jan 3, 2011')
insert #source values(4,'jan 4, 2011')
go
insert #source values( 200236 , '2011-01-02')
insert #source values( 200236 , '2011-01-03')
insert #source values( 200236 , '2011-01-05')
insert #source values( 200236 , '2011-01-06')
insert #source values( 200236 , '2011-01-07')
insert #source values( 200236 , '2011-01-08')
insert #source values( 200236 , '2011-01-09')
insert #source values( 200236 , '2011-01-10')
insert #source values( 200236 , '2011-01-11')
insert #source values( 200236 , '2011-01-12')
insert #source values( 200236 , '2011-01-13')
insert #source values( 200236 , '2011-01-15')
insert #source values( 200236 , '2011-01-16')
insert #source values( 200236 , '2011-01-17')
go
drop table #candidate_range
go
create table #candidate_range
(
rowId int not null identity(1,1) ,
id int not null ,
dtFrom datetime not null ,
dtThru datetime not null ,
length as 1+datediff(day,dtFrom,dtThru) ,
primary key nonclustered ( rowID ) ,
unique clustered (id,dtFrom,dtThru) ,
)
go
--
-- seed the candidate range table with the set of all possible contiguous ranges for each id
--
insert #candidate_range ( id , dtFrom , dtThru )
select id = tFrom.id ,
valFrom = tFrom.dtCol ,
valThru = tThru.dtCol
from #source tFrom
join #source tThru on tThru.id = tFrom.id
and tThru.dtCol >= tFrom.dtCol
where 1+datediff(day,tFrom.dtCol,tThru.dtCol) = ( select count(*)
from #source t
where t.id = tFrom.id
and t.dtCol between tFrom.dtCol and tThru.dtCol
)
order by 1,2,3
go
--
-- compare the table to itself. If we find overlapping periods,
-- we'll keep the longest such period and delete the shorter overlapping periods.
--
delete t2
from #candidate_range t1
join #candidate_range t2 on t2.id = t1.id
and t2.rowId != t1.rowID
and t2.length < t1.length
and t2.dtFrom <= t1.dtThru
and t2.dtThru >= t1.dtFrom
go
That's about all there is to it.