Background (Input)
The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:
"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14 -- Start of March
"2008-03-02";17
"2008-03-05";17
Problem (Output)
Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:
D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14 -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17
Some measurements were taken in the year 1843.
Question
For all weather stations, how would you mark all the days in months that are missing one or more days?
Source Code
The code to select the data resembles:
select
m.id,
m.taken,
m.station_id,
m.amount
from
climate.measurement
Related Ideas
Generate a table filled with contiguous dates and compare them to the measured data dates.
- What is the most straightforward way to pad empty dates in sql results (on either mysql or perl end)?
- How do I group on continuous ranges
- http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx
Update
The problem can be recreated using the SQL in this section.
Table
The table is created as follows:
CREATE TABLE climate.calendar
(
id serial NOT NULL,
n character varying(2) NOT NULL,
d date NOT NULL,
"valid" boolean NOT NULL DEFAULT true,
CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
Generate Data
The following SQL inserts data into a table (id
[int], n
ame [varchar], d
ate [date], valid
[boolean]):
insert into climate.calendar (n, d)
select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
The values 'A'
through 'F'
represent the names of weather stations that made a measurement on a particular day.
Remove Random Rows
Delete some rows as follows:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
Attempt #1
The following does not toggle the valid
flag to false
for all the days in a month where the month is missing one or more days:
UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
SELECT DISTINCT date_trunc('month', d)
FROM climate.calendar A
WHERE NOT EXISTS (
SELECT 1
FROM climate.calendar B
WHERE A.d - 1 = B.d
)
);
Attempt #2
The following SQL produces an empty result set:
with gen_calendar as (
select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;
Attempt #3
The following SQL generates all the possible combinations of station names and dates:
select
distinct( cc.n ), t.d
from
climate.calendar cc,
(
select (date('1982-01-1') + (n || ' days')::interval)::date d
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
) t
order by
cc.n
However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.
Attempt #4
PostgreSQL has windowing functions.
How to select specific changes using windowing functions in postgres
Thank you!