Islands and Gaps Issue

2019-05-07 14:29发布

问题:

Backstory: I have a database that has data points of drivers in trucks which also contain the. While in a truck, the driver can have a 'driverstatus'. What I'd like to do is group these statuses by driver, truck.

As of now, I've tried using LAG/LEAD to help. The reason for this is so I can tell when a driverstatus change occurs, and then I can mark that row as having the last datetime of that status.

That in itself is insufficient, because I need to group the statuses by their status and date. For this, I've got something such as DENSE_RANK, but I can't manage to get that right concerning the ORDER BY clause.

Here is my test data, and here is one attempt of many of me floundering with ranking.

/****** Script for SelectTopNRows command from SSMS  ******/
DECLARE @SomeTable TABLE
(
    loginId VARCHAR(255),
    tractorId VARCHAR(255),
    messageTime DATETIME,
    driverStatus VARCHAR(2)
);

INSERT INTO @SomeTable (loginId, tractorId, messageTime, driverStatus)
VALUES('driver35','23533','2018-08-10 8:33 AM','2'),
('driver35','23533','2018-08-10 8:37 AM','2'),
('driver35','23533','2018-08-10 8:56 AM','2'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 9:07 AM','1'),
('driver35','23533','2018-08-10 9:04 AM','1'),
('driver35','23533','2018-08-12 8:07 AM','3'),
('driver35','23533','2018-08-12 8:37 AM','3'),
('driver35','23533','2018-08-12 9:07 AM','3'),
('driver35','23533','2018-06-12 8:07 AM','2'),
('driver35','23533','2018-06-12 8:37 AM','2'),
('driver35','23533','2018-06-12 9:07 AM','2')
;
SELECT *, DENSE_RANK() OVER(PARTITION BY 
  loginId, tractorId, driverStatus 
ORDER BY messageTime ) FROM @SomeTable
;

My end result would ideally look something like this:

loginId tractorId   startTime           endTime            driverStatus
driver35    23533   2018-08-10 8:33 AM  2018-08-10 8:56 AM      2
driver35    23533   2018-08-10 8:57 AM  2018-08-10 9:07 AM      1
driver35    23533   2018-08-12 8:07 AM  2018-08-12 9:07 AM      3

Any help on this is greatly appreciated.

回答1:

WITH drivers_data AS
(
    SELECT *,
           row_num =     ROW_NUMBER()
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime),

           row_num_all = ROW_NUMBER()
                         OVER (PARTITION BY loginId,
                                            tractorId
                               ORDER BY messageTime),

           first_date =  FIRST_VALUE (messageTime)
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime),

           last_date =   LAST_VALUE (messageTime)
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime
                               ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
    FROM @t
)
SELECT loginId, tractorId, first_date, last_date, driverStatus
FROM drivers_data
WHERE row_num = 1
ORDER BY row_num_all;

OUTPUT:

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:57:00 | 2018-10-08 09:07:00 | 1            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-12-06 08:07:00 | 2018-12-06 09:07:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-12-08 08:07:00 | 2018-12-08 09:07:00 | 3            |
+----------+-----------+---------------------+---------------------+--------------+

I will try to explain what's going on here:

  1. row_num This is for numbering rows which are restricted by date and status of the driver. We need casting since we need date part without time.
  2. row_num_all This is the key attribute since it allows us in the end to sort rows by occurrence. This window is not restricted by status since we need numbering across whole driver's data.
  3. first_date The FIRST_VALUE is handy function for our purpose. It just retrieves the first datetime occurrence.
  4. last_date It's correct to assume that for the last date we need LAST_VALUE window function. But using it is tricky and requires more explanation. As you can see I explicitly use special framing ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. But why? Let me explain. Let's take a part of output for date 10/8/2018 and status 2 with default framing. We get the following results:
+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            | 
+----------+-----------+---------------------+---------------------+--------------+

As you can see, the last date is incorrect! This happens because LAST_VALUE uses default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW- it means that last row is always current row in window. Here's what happens under the hood. Three windows get created. Each row gets its own window. Then it retrieves the last row from window:

Window for 1-st row

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
+----------+-----------+---------------------+---------------------+--------------+

Window for 2-nd row

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
+----------+-----------+---------------------+---------------------+--------------+

Window for 3-rd row

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            | 
+----------+-----------+---------------------+---------------------+--------------+

So, the solution for this is to change framing: we need to move not from beginning to current row, but from current row to the end. So, UNBOUNDED FOLLOWING just means this - last row in current window.

  1. Next is WHERE row_num = 1. This is all simple: since all rows has same information about first date and last date, we just need first row.

  2. The final part is ORDER BY row_num_all. This is where you get your correct ordering.

P.S.

  1. Your desired output in question is incorrect. For date 8/10/18 8:57 AM and status 1 the last date must be 10/8/2018 9:07 AM - not 10/8/2018 9:04 AM, as you mentioned.

  2. Also there's missing output for date 12/6/2018 and status 2.

UPDATE:

Here are illustrations of how FIRST_VALUE and LAST_VALUE work.

All three figures have following parts:

  1. Query data This is result of query.
  2. Original query Original source data.
  3. Windows These are intermediate steps for calculations.
  4. Frame Mentions which frame is used.
  5. Green cell Window specification.

Here's what's happening under the hood:

  1. First, SQL Server creates partitions for all mentioned fields. On figure it is partition column.
  2. Each partition can have a frame: either default or custom. The default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that the row gets window between the start of partition and current row. If you don't mention a frame, the default frame comes into play.
  3. Each frame creates window for each row. On figures these windows are in columns row 1 to row 2 and marked with color. The row number corresponds to row_num_all field.
  4. A row operates only in bounds of its window.

1. FIRST_VALUE

To get first date, we can use handy FIRST_VALUE window function. As you can see, we use default frame here. This means that for each row the window will be between the start of window and current row. For getting first date this is just what we need. Each row will fetch the value from first row. The first date is in "first_date" field.

2. LAST_VALUE - incorrect frame

Now we need to calculate last date. The last date is in the last row of partition, so we can use LAST_VALUE window function. As I mentioned earlier, if we don't mention frame, the default frame is used. As you can see on figure, the frame always ends in the current row - this is incorrect, because we need the date from last window row. The last_date field shows us incorrect results - it reflects the date from current row.

3. LAST_VALUE - correct frame

To fix the situation with fetching last date, we need to change the frame upon which LAST_VALUE will operate on: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. As you can see, the window for each row now is between current row and the end of partition. In this case LAST_VALUE will correctly fetch the date from last row of window. Now the result in last_date field is correct.



回答2:

The below solution identifies each time an island starts (when a driverStatus changes) within each loginID / tractorID combination, and then assigns an "id" number to that island.

After that, its a simple min/max to find when that island starts and ends.

Answer:

select b.loginId
, b.tractorId
, min(b.messageTime) as startTime
, max(b.messageTime) as endTime
, b.driverStatus
from (
    select a.loginId
    , a.tractorId
    , a.messageTime
    , a.driverStatus
    , a.is_island_start_flg
    , sum(a.is_island_start_flg) over (partition by a.loginID, a.tractorID order by a.messageTime asc) as island_nbr --assigning the "id" number to the island
    from (
        select st.loginId
        , st.tractorId
        , st.messageTime
        , st.driverStatus
        , iif(lag(st.driverStatus, 1, st.driverStatus) over (partition by st.loginID, st.tractorId order by st.messageTime asc) = st.driverStatus, 0, 1) as is_island_start_flg --identifying start of island
        from @SomeTable as st
        ) as a
    ) as b
group by b.loginId
, b.tractorId
, b.driverStatus
, b.island_nbr --purposefully in the group by, to make sure each occurrence of a status is in final results
order by b.loginId asc
, b.tractorId asc
, min(b.messageTime) asc

When you leave off the last three records of the sample data (as this is not in the expected output of the question, just like JohnyL said), this query produces the exact output from the question.



回答3:

SELECT 
  t.loginId, 
  t.tractorId, 
  startTime = MIN(messageTime), 
  endTime   = MAX(messageTime),
  driverStatus 
FROM @someTable t
GROUP BY loginId, tractorId, driverStatus
ORDER BY MIN(messageTime);

Results:

loginId        tractorId  startTime               endTime                 driverStatus
-------------- ---------- ----------------------- ----------------------- ------------
driver35       23533      2018-10-08 08:33:00.000 2018-10-08 08:56:00.000 2
driver35       23533      2018-10-08 08:57:00.000 2018-10-08 09:07:00.000 1
driver35       23533      2018-12-08 08:07:00.000 2018-12-08 09:07:00.000 3