I have the following table structure:
Item
ID | Name
--------
1 | Apple
2 | Pear
3 | Banana
4 | Plum
5 | Tomato
Event
ItemStart | ItemEnd | EventType | EventDate
--------------------------------------------
1 | 2 | Planted | 2014-01-01
1 | 3 | Picked | 2014-01-02
3 | 5 | Eaten | 2014-01-05
The two tables are linked only by the primary key of Item and the range of ItemStart and ItemEnd (inclusive) in Event. Events always refer to contiguous sequences of Items, but not all the Events for a given Item will have the same range. Events never occur on the same date for a given Item.
The query I'd like to produce is as follows:
List all the Items, and for each Item show the most recent Event
Sample output:
ID | Name | Event | Date
----------------------------
1 | Apple | Picked | 2014-01-02 (Planted then Picked)
2 | Pear | Picked | 2014-01-02 (Planted then Picked)
3 | Banana | Eaten | 2014-01-05 (Picked then Eaten)
4 | Plum | Eaten | 2014-01-05 (Eaten)
5 | Tomato | Eaten | 2014-01-05 (Eaten)
This seems reasonable enough on the face of it, and if there were traditional foreign-key relationships in place (imagine ItemID
instead of ItemStart
and ItemEnd
) I'd probably do a join to a correlated subquery something like this:
SELECT Name, EventType, EventDate
FROM Item i
INNER JOIN (
SELECT ItemID, EventType, EventDate
FROM Event e
WHERE EventDate = (SELECT MAX(EventDate) FROM Event e_max WHERE e_max.ItemID = e.ItemID)
) latest_events ON i.ID = latest_events.ItemID
However, with the range relationship in place I'm stuck, I want to do something more like this but it doesn't work:
SELECT Name, EventType, EventDate
FROM Item i
INNER JOIN (
SELECT ItemStart, ItemEnd, EventType, EventDate
FROM Event e
WHERE EventDate = (SELECT MAX(EventDate) FROM Event e_max WHERE i.ID >= e_max.ItemStart AND i.ID <= e_max.ItemEnd)
) latest_events ON i.ID >= latest_events.ItemStart AND i.ID <= latest_events.ItemEnd
I get an error about i.ID >= e_max.ItemStart AND i.ID <= e_max.ItemEnd
on line 6, because you can't reference i
from within another part of the join. I wanted to do that (which isn't necessary in the simpler example) because when I'm constructing the subquery I no longer have a single ID to link to - the overlapping ranges mean that there are many possible ways of including a single Item, and so I want to refer directly back to that item, whose ID is only available in the top-level Item table.
I hope that makes sense.
I'm using SQL Server 2008 R2. This is for a report which will run overnight, so speed is not as important is it could be, but there are very many Items (100s of million); while there are multiple Events against each Item, the use of large ranges means there are much much fewer Event records.
Things I've thought about:
- Somehow expanding out the Item/Event relationship so that Events are recorded against all the individual Items. This would result in a significant increase in the amount of data under consideration, but would allow the simpler query approach.
- Somehow process the Events to constrain or consolidate the ranges - if I knew that for a given Item all its Events had the same start and end I could perhaps simplify things. Not fully thought that through.
How can I produce this query? Thanks in advance!
This should be similar to other greatest-in-group and joining-on-range solutions:
You can do this utilizing a
CTE
androw_number()
.SQL Fiddle Demo
Basically the CTE has joined item and event and added a new column for rownumber and is partitioned on item.ID. Here's a screenshot of what it looks like. From here I just select rNum = 1 which should be the max event date for each item.id.