Jump SQL gap over specific condition & proper lead

2019-02-20 10:02发布

问题:

(PostgreSQL 8.4) Continuing with my previous example, I wish to further my understanding of gaps-and-islands processing with Window-functions. Consider the following table and data:

CREATE TABLE T1
(
  id SERIAL PRIMARY KEY,
  val INT,   -- some device
  status INT -- 0=OFF, 1=ON
);

INSERT INTO T1 (val, status) VALUES (10, 0);
INSERT INTO T1 (val, status) VALUES (11, 0);
INSERT INTO T1 (val, status) VALUES (11, 1);
INSERT INTO T1 (val, status) VALUES (10, 1);
INSERT INTO T1 (val, status) VALUES (11, 0);
INSERT INTO T1 (val, status) VALUES (10, 0);

As previously explained, the devices turn ON and OFF and this time I wish to extract a specific sequence:

  • show all new ON status records that aren't duplicate (same device twice in a row)
  • show appropriate OFF status from currently ON device

The closest I could get is this:

SELECT * FROM  (
  SELECT *
       ,lag(val, 1, 0) OVER (PARTITION BY status ORDER BY id) last_val
       ,lag(status, 1, -1) OVER (PARTITION BY val ORDER BY id) last_status
  FROM   t1
  ) x
WHERE  (last_val <> val OR last_status <> status)
AND    (status = 1 OR last_status <> -1)
ORDER BY id

This filters out more bogus data that the sample doesn't include but essentially it's about taking out subsequent duplicates (regardless of status) and the top OFF records which don't match. Records 3, 4, 5 and 6 are returned, but I don't want the fifth, it's an OFF which came after a new ON. So I need to jump that gap and look for the next proper OFF for the currently active device.

  1. 10 turns OFF -- bogus in this context, but messes up the lag()
  2. 11 turns OFF -- bogus in this context, but messes up the lag()
  3. 11 turns ON -- OK, new sequence, include in SELECT
  4. 10 turns ON -- OK, new sequence, include in SELECT
  5. 11 turns OFF -- message came in late, need to ignore gap
  6. 10 turns OFF -- OK, proper OFF to row 4, need to include in SELECT

Once filtered properly, I would like to use lead() on top of it all to get the next row's id (imagine a timestamp) as well as filter out all records that aren't ON statuses. I imagine this will require three embedded SELECT statements. This would get me a clear understanding of how long a device was active, until the condition of either another ON or a proper turn OFF.

回答1:

Query with window functions

SELECT *
FROM  (
   SELECT *
         ,lag(val, 1, 0)    OVER (PARTITION BY status ORDER BY id) AS last_val
         ,lag(status, 1, 0) OVER w2 AS last_status
         ,lag(next_id)      OVER w2 AS next_id_of_last_status
   FROM  (
      SELECT *, lead(id) OVER (PARTITION BY status ORDER BY id) AS next_id
      FROM   t1
      ) AS t
   WINDOW w2 AS (PARTITION BY val ORDER BY id)
  ) x
WHERE (last_val <> val OR last_status <> status)
AND   (status = 1 
       OR last_status = 1
          AND ((next_id_of_last_status > id) OR next_id_of_last_status IS NULL)
      )
ORDER  BY id

In addition to what we already had, we need valid OFF switches.

An OFF switch if valid if the device was switched ON before (last_status = 1) and the next ON operation after that comes after the OFF switch in question (next_id_of_last_status > id).

We have to provide for the special case that there is was the last ON operation, so we check for NULL in addition (OR next_id_of_last_status IS NULL).

The next_id_of_last_status comes from the same window that we take last_status from. Therefore I introduced additional syntax for explicit window declaration, so I don't have to repeat myself:

WINDOW w2 AS (PARTITION BY val ORDER BY id)

And we need to get the next id for the last status in a subquery earlier (subquery t).

If you've understood all that, you shouldn't have a problem slapping lead() on top of this query to get to your final destination. :)

PL/pgSQL function

Once it gets this complex, it's time to switch to procedural processing.

This comparatively simple plpgsql function nukes the performance of the complex window function query, for the simple reason that it has to scan the whole table only once.

CREATE OR REPLACE FUNCTION valid_t1 (OUT t t1)  -- row variable of table type
  RETURNS SETOF t1 LANGUAGE plpgsql AS
$func$
DECLARE
   _last_on int := -1;  -- init with impossible value
BEGIN

FOR t IN
   SELECT * FROM t1 ORDER BY id
LOOP
   IF t.status = 1 THEN
      IF _last_on <> t.val THEN
         RETURN NEXT;
         _last_on := t.val;
      END IF;
   ELSE
      IF _last_on = t.val THEN
         RETURN NEXT;
         _last_on := -1;
      END IF;
   END IF;
END LOOP;

END
$func$;

Call:

SELECT * FROM valid_t1();