I have daily time series for different companies in my dataset and work with PostgreSQL. My goal is to exclude companies with too incomplete time series. Therefor I want to exclude all companies which have 3 or more consecutive missing values. Furthermore I want to exclude all companies which have more than 50% missing values between their first and final date in the dataset.
We can work with the following example data:
date company value
2012-01-01 A 5
2012-01-01 B 2
2012-01-02 A NULL
2012-01-02 B 2
2012-01-02 C 4
2012-01-03 A NULL
2012-01-03 B NULL
2012-01-03 C NULL
2012-01-04 A NULL
2012-01-04 B NULL
2012-01-04 C NULL
2012-01-05 A 8
2012-01-05 B 9
2012-01-05 C 3
2012-01-06 A 8
2012-01-06 B 9
2012-01-06 C NULL
So A has to be excluded because it has a gap of three consecutive missing values, and C because it has more than 50% missing values between its first and final date.
Combining other answers in this forum I made up the following code:
Add an autoincrement primary key to identify each row
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date; CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE; ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq'); UPDATE test SET id = NEXTVAL('id_seq'); ALTER TABLE test ADD PRIMARY KEY (id);
Detect the gaps in the time series
CREATE TABLE to_del AS WITH count3 AS ( SELECT *, COUNT(CASE WHEN value IS NULL THEN 1 END) OVER (PARTITION BY company ORDER BY id ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS cnt FROM test) SELECT company, id FROM count3 WHERE cnt >= 3;
Delete the gaps from mytable
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
It seems to achieve to detect and delete gaps of 3 or more consecutive missing values from the time series. But this approach is very cumbersome. And I can't figure out how to additinoally exclude all companies with more than 50% missing values.
Can you think of a more effective solution than mine (I just learn to work with PostgreSQL), that also manages to exclude companies with more than 50% missing values?
For the 50% criteria, you could select all the companies for which the number of distinct dates in lower than half the number of days between the min and max dates.
I have not tested this but it should give you an idea. I used a CTE to make it easier to read.
I would create only one query:
A composite index on (company + value) columns can help to gain a maximum speed of this query.
EDIT
The above query doesn't work
I've corrected it slightly, here is a demo: http://sqlfiddle.com/#!15/c9bfe/7
Two things have been changed:
- PARTITION BY company ORDER BY date instead of ORDER BY id
- explicit cast to numeric( because integer have been truncated to 0):
OVER (PARTITION BY company)::numeric
and now the delete query should work: