Background (Input)
The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:
"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14 -- Start of March
"2008-03-02";17
"2008-03-05";17
Problem (Output)
Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:
D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14 -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17
Some measurements were taken in the year 1843.
Question
For all weather stations, how would you mark all the days in months that are missing one or more days?
Source Code
The code to select the data resembles:
select
m.id,
m.taken,
m.station_id,
m.amount
from
climate.measurement
Related Ideas
Generate a table filled with contiguous dates and compare them to the measured data dates.
- What is the most straightforward way to pad empty dates in sql results (on either mysql or perl end)?
- How do I group on continuous ranges
- http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx
Update
The problem can be recreated using the SQL in this section.
Table
The table is created as follows:
CREATE TABLE climate.calendar
(
id serial NOT NULL,
n character varying(2) NOT NULL,
d date NOT NULL,
"valid" boolean NOT NULL DEFAULT true,
CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
Generate Data
The following SQL inserts data into a table (id
[int], n
ame [varchar], d
ate [date], valid
[boolean]):
insert into climate.calendar (n, d)
select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
The values 'A'
through 'F'
represent the names of weather stations that made a measurement on a particular day.
Remove Random Rows
Delete some rows as follows:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
Attempt #1
The following does not toggle the valid
flag to false
for all the days in a month where the month is missing one or more days:
UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
SELECT DISTINCT date_trunc('month', d)
FROM climate.calendar A
WHERE NOT EXISTS (
SELECT 1
FROM climate.calendar B
WHERE A.d - 1 = B.d
)
);
Attempt #2
The following SQL produces an empty result set:
with gen_calendar as (
select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;
Attempt #3
The following SQL generates all the possible combinations of station names and dates:
select
distinct( cc.n ), t.d
from
climate.calendar cc,
(
select (date('1982-01-1') + (n || ' days')::interval)::date d
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
) t
order by
cc.n
However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.
Attempt #4
PostgreSQL has windowing functions.
How to select specific changes using windowing functions in postgres
Thank you!
Assuming that there can't be more than one row per day, this should return all the months for which the number of rows is not equal to the number of days in the month.
Here is one way you could do it, assuming you have a BOOLEAN field called is_contiguous. Modify as necessary:
Edit:
I believe I misunderstood your requirement. I thought you wanted to flag individual dates that were not contiguous. But apparently, you want to flag an entire month's worth of dates as discontiguous if it is missing any number of days.
Edit 2:
Here is a modified version of my original (incorrect) query which selects the distinct months that are missing any days:
generate_series()
PostgreSQL's
generate_series()
function can create a view that contains a consecutive list of dates:The expression
select max(date) - min(date) from test
might be off by one.Count Days Per Month
One way to identify invalid months is to create two views. The first counts the number of daily readings each station should produce in each month. (Note that
climate.calendar
is translated toclimate_calendar
.) The second returns the actual daily readings each station produced per month.Maximum Days Per Month Per Station
This view will return the actual number of days in a month, per station. (For example, February will always have either 28 or 29 days.)
Actual Days Per Month Per Station
The total number of days returned will be fewer than the tallies. (For example, January will always have 31 days or fewer.)
Drop the
ORDER BY
clauses in production (they're helpful in development).Compare Views
Join the two views to identify the stations and months that need to be flagged, into a new view:
The column
num_days_missing
is not necessary, but it is useful.These are the rows that need to be updated:
Update Database
To update them, the
id
key is convenient.