I have (simplified for the example) a table with the following data
Row Start Finish ID Amount
--- --------- ---------- -- ------
1 2008-10-01 2008-10-02 01 10
2 2008-10-02 2008-10-03 02 20
3 2008-10-03 2008-10-04 01 38
4 2008-10-04 2008-10-05 01 23
5 2008-10-05 2008-10-06 03 14
6 2008-10-06 2008-10-07 02 3
7 2008-10-07 2008-10-08 02 8
8 2008-10-08 2008-11-08 03 19
The dates represent a period in time, the ID is the state a system was in during that period and the amount is a value related to that state.
What I want to do is to aggregate the Amounts for adjacent rows with the same ID number, but keep the same overall sequence so that contiguous runs can be combined. Thus I want to end up with data like:
Row Start Finish ID Amount
--- --------- ---------- -- ------
1 2008-10-01 2008-10-02 01 10
2 2008-10-02 2008-10-03 02 20
3 2008-10-03 2008-10-05 01 61
4 2008-10-05 2008-10-06 03 14
5 2008-10-06 2008-10-08 02 11
6 2008-10-08 2008-11-08 03 19
I am after a T-SQL solution that can be put into a SP, however I can't see how to do that with simple queries. I suspect that it may require iteration of some sort but I don't want to go down that path.
The reason I want to do this aggregation is that the next step in the process is to do a SUM() and Count() grouped by the unique ID's that occur within the sequence, so that my final data will look something like:
ID Counts Total
-- ------ -----
01 2 71
02 2 31
03 2 33
However if I do a simple
SELECT COUNT(ID), SUM(Amount) FROM data GROUP BY ID
On the original table I get something like
ID Counts Total
-- ------ -----
01 3 71
02 3 31
03 2 33
Which is not what I want.
If you read the book "Developing Time-Oriented Database Applications in SQL" by R T Snodgrass (the pdf of which is available from his web site under publications), and get as far as Figure 6.25 on p165-166, you will find the non-trivial SQL which can be used in the current example to group the various rows with the same ID value and continuous time intervals.
The query development below is close to correct, but there is a problem spotted right at the end, that has its source in the first SELECT statement. I've not yet tracked down why the incorrect answer is being given. [If someone can test the SQL on their DBMS and tell me whether the first query works correctly there, it would be a great help!]
It looks something like:
-- Derived from Figure 6.25 from Snodgrass "Developing Time-Oriented
-- Database Applications in SQL"
CREATE TABLE Data
(
Start DATE,
Finish DATE,
ID CHAR(2),
Amount INT
);
INSERT INTO Data VALUES('2008-10-01', '2008-10-02', '01', 10);
INSERT INTO Data VALUES('2008-10-02', '2008-10-03', '02', 20);
INSERT INTO Data VALUES('2008-10-03', '2008-10-04', '01', 38);
INSERT INTO Data VALUES('2008-10-04', '2008-10-05', '01', 23);
INSERT INTO Data VALUES('2008-10-05', '2008-10-06', '03', 14);
INSERT INTO Data VALUES('2008-10-06', '2008-10-07', '02', 3);
INSERT INTO Data VALUES('2008-10-07', '2008-10-08', '02', 8);
INSERT INTO Data VALUES('2008-10-08', '2008-11-08', '03', 19);
SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS (SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS (SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS (SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish)
OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)));
The output from that query is:
01 2008-10-01 2008-10-02
01 2008-10-03 2008-10-05
02 2008-10-02 2008-10-03
02 2008-10-06 2008-10-08
03 2008-10-05 2008-10-06
03 2008-10-05 2008-11-08
03 2008-10-08 2008-11-08
Edited: There's a problem with the penultimate row - it should not be there. And I'm not clear (yet) where it is coming from.
Now we need to treat that complex expression as a query expression in the FROM clause of another SELECT statement, which will sum the amount values for a given ID over the entries that overlap with the maximal ranges shown above.
SELECT M.ID, M.Start, M.Finish, SUM(D.Amount)
FROM Data AS D,
(SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS (SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS (SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS (SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish)
OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)))) AS M
WHERE D.ID = M.ID
AND M.Start <= D.Start
AND M.Finish >= D.Finish
GROUP BY M.ID, M.Start, M.Finish
ORDER BY M.ID, M.Start;
This gives:
ID Start Finish Amount
01 2008-10-01 2008-10-02 10
01 2008-10-03 2008-10-05 61
02 2008-10-02 2008-10-03 20
02 2008-10-06 2008-10-08 11
03 2008-10-05 2008-10-06 14
03 2008-10-05 2008-11-08 33 -- Here be trouble!
03 2008-10-08 2008-11-08 19
Edited: This is almost the correct data set on which to do the COUNT and SUM aggregation requested by the original question, so the final answer is:
SELECT I.ID, COUNT(*) AS Number, SUM(I.Amount) AS Amount
FROM (SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) AS Amount
FROM Data AS D,
(SELECT DISTINCT F.ID, F.Start, L.Finish
FROM Data AS F, Data AS L
WHERE F.Start < L.Finish
AND F.ID = L.ID
-- There are no gaps between F.Finish and L.Start
AND NOT EXISTS
(SELECT *
FROM Data AS M
WHERE M.ID = F.ID
AND F.Finish < M.Start
AND M.Start < L.Start
AND NOT EXISTS
(SELECT *
FROM Data AS T1
WHERE T1.ID = F.ID
AND T1.Start < M.Start
AND M.Start <= T1.Finish))
-- Cannot be extended further
AND NOT EXISTS
(SELECT *
FROM Data AS T2
WHERE T2.ID = F.ID
AND ((T2.Start < F.Start AND F.Start <= T2.Finish) OR
(T2.Start <= L.Finish AND L.Finish < T2.Finish)))
) AS M
WHERE D.ID = M.ID
AND M.Start <= D.Start
AND M.Finish >= D.Finish
GROUP BY M.ID, M.Start, M.Finish
) AS I
GROUP BY I.ID
ORDER BY I.ID;
id number amount
01 2 71
02 2 31
03 3 66
Review:
Oh! Drat...the entry for 3 has twice the 'amount' that it should have. Previous 'edited' parts indicate where things started to go wrong. It looks as though either the first query is subtly wrong (maybe it is intended for a different question), or the optimizer I'm working with is misbehaving. Nevertheless, there should be an answer closely related to this that will give the correct values.
For the record: tested on IBM Informix Dynamic Server 11.50 on Solaris 10. However, should work fine on any other moderately standard-conformant SQL DBMS.
Probably need to create a cursor and loop through the results, keeping track of which id you are working with and accumulating the data along the way. When the id changes you can insert the accumulated data into a temporary table and return the table at the end of the procedure (select all from it). A table-based function might be better as you can then just insert into the return table as you go along.
I suspect that it may require iteration of some sort but I don't want to go down that path.
I think that's the route you'll have to take, use a cursor to populate a table variable. If you have a large number of records you could use a permanent table to store the results then when you need to retrieve the data you could process only the new data.
I would add a bit field with a default of 0 to the source table to keep track of which records have been processed. Assuming no one is using select * on the table, adding a column with a default value won't affect the rest of your application.
Add a comment to this post if you want help coding the solution.
Well I decided to go down the iteration route using a mixture of joins and cursors. By JOINing the data table against itself I can create a link list of only those records that are consecutive.
INSERT INTO #CONSEC
SELECT a.ID, a.Start, b.Finish, b.Amount
FROM Data a JOIN Data b
ON (a.Finish = b.Start) AND (a.ID = b.ID)
Then I can unwind the list by iterating over it with a cursor, and doing updates back to the data table to adjust (And delete the now extraneous records from the Data table)
DECLARE CCursor CURSOR FOR
SELECT ID, Start, Finish, Amount FROM #CONSEC ORDER BY Start DESC
@Total = 0
OPEN CCursor
FETCH NEXT FROM CCursor INTO @ID, @START, @FINISH, @AMOUNT
WHILE @FETCH_STATUS = 0
BEGIN
@Total = @Total + @Amount
@Start_Last = @Start
@Finish_Last = @Finish
@ID_Last = @ID
DELETE FROM Data WHERE Start = @Finish
FETCH NEXT FROM CCursor INTO @ID, @START, @FINISH, @AMOUNT
IF (@ID_Last<> @ID) OR (@Finish<>@Start_Last)
BEGIN
UPDATE Data
SET Amount = Amount + @Total
WHERE Start = @Start_Last
@Total = 0
END
END
CLOSE CCursor
DEALLOCATE CCursor
This all works and has acceptable performance for typical data that I am using.
I did find one small issue with the above code. Originally I was updating the Data table on each loop through the cursor. But this didn't work. It seems that you can only do one update on a record, and that multiple updates (in order to keep adding data) revert back to the reading the original contents of the record.