SQL join to correlated subquery where tables are r

2019-01-28 21:10发布

I have the following table structure:

Item

ID | Name
--------
1  | Apple
2  | Pear 
3  | Banana
4  | Plum
5  | Tomato

Event

ItemStart | ItemEnd | EventType | EventDate
--------------------------------------------
     1    |    2    |  Planted  | 2014-01-01
     1    |    3    |  Picked   | 2014-01-02
     3    |    5    |  Eaten    | 2014-01-05

The two tables are linked only by the primary key of Item and the range of ItemStart and ItemEnd (inclusive) in Event. Events always refer to contiguous sequences of Items, but not all the Events for a given Item will have the same range. Events never occur on the same date for a given Item.

The query I'd like to produce is as follows:

List all the Items, and for each Item show the most recent Event

Sample output:

ID | Name   |   Event | Date
----------------------------
1  | Apple  |  Picked | 2014-01-02 (Planted then Picked)
2  | Pear   |  Picked | 2014-01-02 (Planted then Picked)
3  | Banana |  Eaten  | 2014-01-05 (Picked then Eaten)
4  | Plum   |  Eaten  | 2014-01-05 (Eaten)
5  | Tomato |  Eaten  | 2014-01-05 (Eaten)

This seems reasonable enough on the face of it, and if there were traditional foreign-key relationships in place (imagine ItemID instead of ItemStart and ItemEnd) I'd probably do a join to a correlated subquery something like this:

SELECT Name, EventType, EventDate
FROM Item i
    INNER JOIN (
        SELECT ItemID, EventType, EventDate
        FROM Event e
        WHERE EventDate = (SELECT MAX(EventDate) FROM Event e_max WHERE e_max.ItemID = e.ItemID)
    ) latest_events ON i.ID = latest_events.ItemID

However, with the range relationship in place I'm stuck, I want to do something more like this but it doesn't work:

SELECT Name, EventType, EventDate
FROM Item i
    INNER JOIN (
        SELECT ItemStart, ItemEnd, EventType, EventDate
        FROM Event e
        WHERE EventDate = (SELECT MAX(EventDate) FROM Event e_max WHERE i.ID >= e_max.ItemStart AND i.ID <= e_max.ItemEnd)
    ) latest_events ON i.ID >= latest_events.ItemStart AND i.ID <= latest_events.ItemEnd

I get an error about i.ID >= e_max.ItemStart AND i.ID <= e_max.ItemEnd on line 6, because you can't reference i from within another part of the join. I wanted to do that (which isn't necessary in the simpler example) because when I'm constructing the subquery I no longer have a single ID to link to - the overlapping ranges mean that there are many possible ways of including a single Item, and so I want to refer directly back to that item, whose ID is only available in the top-level Item table.

I hope that makes sense.

I'm using SQL Server 2008 R2. This is for a report which will run overnight, so speed is not as important is it could be, but there are very many Items (100s of million); while there are multiple Events against each Item, the use of large ranges means there are much much fewer Event records.

Things I've thought about:

  • Somehow expanding out the Item/Event relationship so that Events are recorded against all the individual Items. This would result in a significant increase in the amount of data under consideration, but would allow the simpler query approach.
  • Somehow process the Events to constrain or consolidate the ranges - if I knew that for a given Item all its Events had the same start and end I could perhaps simplify things. Not fully thought that through.

How can I produce this query? Thanks in advance!

2条回答
仙女界的扛把子
2楼-- · 2019-01-28 21:41

This should be similar to other greatest-in-group and joining-on-range solutions:

SELECT * FROM 
  Item i INNER JOIN 
  Event e ON i.id BETWEEN e.ItemStart AND e.ItemEnd
WHERE NOT EXISTS ( -- exclude non-last events
   SELECT * FROM Event 
   WHERE 
      i.id between ItemStart and ItemEnd
      AND e.EventDate < EventDate)
查看更多
女痞
3楼-- · 2019-01-28 21:43

You can do this utilizing a CTE and row_number().

SQL Fiddle Demo

;with cte as 
(
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY i.id ORDER BY e.EventDate DESC) as rNum
    FROM Item i
    JOIN Event e
        ON i.id between e.ItemStart and e.ItemEnd
)

SELECT ID,
  Name, 
  EventType,
  EventDate FROM cte
WHERE rNum = 1

Basically the CTE has joined item and event and added a new column for rownumber and is partitioned on item.ID. Here's a screenshot of what it looks like. From here I just select rNum = 1 which should be the max event date for each item.id.

enter image description here

查看更多
登录 后发表回答