Mark non-contiguous date ranges

2019-06-16 20:29发布

Background (Input)

The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:

"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14  -- Start of March
"2008-03-02";17
"2008-03-05";17

Problem (Output)

Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:

D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14  -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17

Some measurements were taken in the year 1843.

Question

For all weather stations, how would you mark all the days in months that are missing one or more days?

Source Code

The code to select the data resembles:

select
  m.id,
  m.taken,
  m.station_id,
  m.amount
from
  climate.measurement

Related Ideas

Generate a table filled with contiguous dates and compare them to the measured data dates.

Update

The problem can be recreated using the SQL in this section.

Table

The table is created as follows:

CREATE TABLE climate.calendar
(
  id serial NOT NULL,
  n character varying(2) NOT NULL,
  d date NOT NULL,
  "valid" boolean NOT NULL DEFAULT true,
  CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

Generate Data

The following SQL inserts data into a table (id [int], name [varchar], date [date], valid [boolean]):

insert into climate.calendar (n, d) 
    select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n

The values 'A' through 'F' represent the names of weather stations that made a measurement on a particular day.

Remove Random Rows

Delete some rows as follows:

delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);

Attempt #1

The following does not toggle the valid flag to false for all the days in a month where the month is missing one or more days:

UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
    SELECT DISTINCT date_trunc('month', d)
    FROM climate.calendar A
    WHERE NOT EXISTS (
        SELECT 1
        FROM climate.calendar B
        WHERE A.d - 1 = B.d
   )
);

Attempt #2

The following SQL produces an empty result set:

with gen_calendar as (
    select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;

Attempt #3

The following SQL generates all the possible combinations of station names and dates:

select
  distinct( cc.n ), t.d
from
  climate.calendar cc,
  (
    select (date('1982-01-1') + (n || ' days')::interval)::date d
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
  ) t
order by
  cc.n

However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.

Attempt #4

PostgreSQL has windowing functions.

How to select specific changes using windowing functions in postgres

Thank you!

3条回答
Animai°情兽
2楼-- · 2019-06-16 21:01

Assuming that there can't be more than one row per day, this should return all the months for which the number of rows is not equal to the number of days in the month.

SELECT station_id, DATE_TRUNC('month', d)
FROM climate.calendar
GROUP BY station_id, DATE_TRUNC('month', d)
HAVING COUNT(*) <> 
  DATE_PART('month',
            DATE_TRUNC('month', d) + INTERVAL '1 month' - INTERVAL '1 day')
查看更多
Deceive 欺骗
3楼-- · 2019-06-16 21:03

Here is one way you could do it, assuming you have a BOOLEAN field called is_contiguous. Modify as necessary:

UPDATE measurement
SET is_contiguous = FALSE
WHERE NOT EXISTS (
  SELECT 1
    FROM measurement B
   WHERE measurement.taken - 1 = B.taken
);

Edit:

I believe I misunderstood your requirement. I thought you wanted to flag individual dates that were not contiguous. But apparently, you want to flag an entire month's worth of dates as discontiguous if it is missing any number of days.

Edit 2:

Here is a modified version of my original (incorrect) query which selects the distinct months that are missing any days:

UPDATE measurement
SET is_contiguous = FALSE
WHERE date_trunc('month', taken) IN (
    SELECT DISTINCT date_trunc('month', taken)
    FROM measurement A
    WHERE NOT EXISTS (
        SELECT 1
        FROM measurement B
        WHERE A.taken - 1 = B.taken
   )
);
查看更多
该账号已被封号
4楼-- · 2019-06-16 21:18

generate_series()

PostgreSQL's generate_series() function can create a view that contains a consecutive list of dates:

with calendar as (
    select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;

The expression select max(date) - min(date) from test might be off by one.

Count Days Per Month

One way to identify invalid months is to create two views. The first counts the number of daily readings each station should produce in each month. (Note that climate.calendar is translated to climate_calendar.) The second returns the actual daily readings each station produced per month.

Maximum Days Per Month Per Station

This view will return the actual number of days in a month, per station. (For example, February will always have either 28 or 29 days.)

create view count_max_station_calendar_days as 
with calendar as (
    select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(d) - min(d) from climate_calendar)) n
)
select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days
from stations cross join calendar
group by n, yr, mo
order by n, yr, mo

Actual Days Per Month Per Station

The total number of days returned will be fewer than the tallies. (For example, January will always have 31 days or fewer.)

create view count_actual_station_calendar_days as
select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days
from climate_calendar
group by n, yr, mo
order by n, yr, mo;

Drop the ORDER BY clauses in production (they're helpful in development).

Compare Views

Join the two views to identify the stations and months that need to be flagged, into a new view:

create view invalid_station_months as 
select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
       on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)

n   yr    mo  num_days_missing
--
A   1982  1   1
E   2007  3   1

The column num_days_missing is not necessary, but it is useful.

These are the rows that need to be updated:

select cc.* 
from climate_calendar cc
inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
where valid = true

Update Database

To update them, the id key is convenient.

update climate_calendar
set valid = false
where id in (
    select id
    from climate_calendar cc
    inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
    where valid = true
);
查看更多
登录 后发表回答