Detect and delete gaps in time series

I have daily time series for different companies in my dataset and work with PostgreSQL. My goal is to exclude companies with too incomplete time series. Therefor I want to exclude all companies which have 3 or more consecutive missing values. Furthermore I want to exclude all companies which have more than 50% missing values between their first and final date in the dataset.

We can work with the following example data:

date             company    value
2012-01-01       A          5
2012-01-01       B          2
2012-01-02       A          NULL
2012-01-02       B          2
2012-01-02       C          4
2012-01-03       A          NULL
2012-01-03       B          NULL
2012-01-03       C          NULL
2012-01-04       A          NULL
2012-01-04       B          NULL
2012-01-04       C          NULL
2012-01-05       A          8
2012-01-05       B          9
2012-01-05       C          3
2012-01-06       A          8
2012-01-06       B          9
2012-01-06       C          NULL

So A has to be excluded because it has a gap of three consecutive missing values, and C because it has more than 50% missing values between its first and final date.

Combining other answers in this forum I made up the following code:

Add an autoincrement primary key to identify each row

CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date; 
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE; 
ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq'); 
UPDATE test SET id = NEXTVAL('id_seq');

ALTER TABLE test ADD PRIMARY KEY (id);

Detect the gaps in the time series

CREATE TABLE to_del AS WITH count3 AS 
( SELECT *, 
  COUNT(CASE WHEN value IS NULL THEN 1 END) 
     OVER (PARTITION BY company ORDER BY id 
           ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) 
  AS cnt FROM test) 
SELECT company, id FROM count3 WHERE cnt >= 3;

Delete the gaps from mytable

DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);

It seems to achieve to detect and delete gaps of 3 or more consecutive missing values from the time series. But this approach is very cumbersome. And I can't figure out how to additinoally exclude all companies with more than 50% missing values.

Can you think of a more effective solution than mine (I just learn to work with PostgreSQL), that also manages to exclude companies with more than 50% missing values?

标签： sql postgresql gaps-and-islands

2条回答

Ridiculous、

2楼-- · 2019-06-14 11:04

For the 50% criteria, you could select all the companies for which the number of distinct dates in lower than half the number of days between the min and max dates.

I have not tested this but it should give you an idea. I used a CTE to make it easier to read.

WITH MinMax AS 
(
    SELECT Company, DATE_PART('day', AGE(MIN(Date), MAX(Date))) AS calendar_days, COUNT(DISTINCT date) AS days FROM table
     GROUP By Company
)
SELECT Company FROM MinMax
 WHERE (calendars_days / 2) > days

0人赞添加讨论(0) 举报

Lonely孤独者°

3楼-- · 2019-06-14 11:06

I would create only one query:

DELETE FROM mytable 
WHERE company in (
  SELECT Company 
  FROM (
    SELECT Company, 
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company ORDER BY id 
               ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company)
      / 
      COUNT(*) 
         OVER (PARTITION BY company) As p50
  ) alias
  WHERE cnt >= 3 OR p50 > 0.5
)

A composite index on (company + value) columns can help to gain a maximum speed of this query.

EDIT

The above query doesn't work
I've corrected it slightly, here is a demo: http://sqlfiddle.com/#!15/c9bfe/7
Two things have been changed:
- PARTITION BY company ORDER BY date instead of ORDER BY id
- explicit cast to numeric( because integer have been truncated to 0):
OVER (PARTITION BY company)::numeric

  SELECT company, cnt, p50
  FROM (
    SELECT company, 
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company ORDER BY date 
               ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
      SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) 
         OVER (PARTITION BY company)::numeric
      / 
      COUNT(*) 
         OVER (PARTITION BY company) As p50
    FROM mytable
  ) alias
--  WHERE cnt >= 3 OR p50 > 0.5

and now the delete query should work:

DELETE FROM mytable 
WHERE company in (
      SELECT company
      FROM (
        SELECT company, 
          COUNT(CASE WHEN value IS NULL THEN 1 END) 
             OVER (PARTITION BY company ORDER BY date 
                   ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
          SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) 
             OVER (PARTITION BY company)::numeric
          / 
          COUNT(*) 
             OVER (PARTITION BY company) As p50
        FROM mytable
      ) alias
    WHERE cnt >= 3 OR p50 > 0.5
)

0人赞添加讨论(0) 举报

Detect and delete gaps in time series

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间