-->

How to aggregate data from multiple years on MM-DD

2019-07-25 09:47发布

问题:

Postgres version 9.4.18, PostGIS Version 2.2.

Here are the tables I'm working with (and can unlikely make significant changes to the table structure):

Table ltg_data (spans 1988 to 2018):

 Column   |           Type           | Modifiers 
----------+--------------------------+-----------
intensity | integer                  | not null
time      | timestamp with time zone | not null
lon       | numeric(9,6)             | not null
lat       | numeric(8,6)             | not null
ltg_geom  | geometry(Point,4269)     | 
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")

Size of ltg_data (~800M rows):

ltg=# select pg_relation_size('ltg_data');
pg_relation_size 
------------------
 149729288192

Table counties:

  Column   |            Type             |                       Modifiers                      
-----------+-----------------------------+--------------------------------- -----------------------
gid        | integer                     | not null default nextval('counties_gid_seq'::regclass)
objectid_1 | integer                     | 
objectid   | integer                     | 
state      | character varying(2)        | 
cwa        | character varying(9)        | 
countyname | character varying(24)       | 
fips       | character varying(5)        | 
time_zone  | character varying(2)        | 
fe_area    | character varying(2)        | 
lon        | double precision            | 
lat        | double precision            | 
the_geom   | geometry(MultiPolygon,4269) | 
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)

Desired result: I want a time series with one row for every day of the year in format 'MM-DD' ignoring the year: 01-01, 01-02, 01-03, ..., 12-31. And the count of rows in table ltg_data for each day of the year. I also eventually want the same thing for every hour of every day of the year ('MM-DD-HH').

A group by statement should accomplish this, but I'm having a hard time joining the "big" table with the days generated with generate_series().

MM-DD  | total_count   
-------+------------
12-22  |       9
12-23  |       0
12-24  |       0
12-25  |       0
12-26  |      23
12-27  |       0
12-28  |       5
12-29  |       0
12-30  |       0
12-31  |       0

Some of my many attempted queries:

SELECT date_trunc('day', d),
   count(a.lat) AS strikes
FROM generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d
LEFT JOIN
(SELECT date_trunc('day', TIME) AS day_of_year,
      ltg_data.lat
 FROM ltg_data
 JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
 WHERE cwa = 'MFR' ) AS a ON d = day_of_year
GROUP BY d
ORDER BY d ASC;

But this doesn't ignore year. I shouldn't be surprised because the "day" in date_trunc is still considering the year I guess.

2017-12-27 00:00:00-08 |       0
2017-12-28 00:00:00-08 |       0
2017-12-29 00:00:00-08 |       0
2017-12-30 00:00:00-08 |       0
2017-12-31 00:00:00-08 |       0
2018-01-01 00:00:00-08 |       0
2018-01-02 00:00:00-08 |       12
2018-01-03 00:00:00-08 |       0

And this query, in which I'm trying to convert the data from generate_series() to text in 'DD-MM' format to join to the ltg_data table in text format. Says the data types don't match. I've tried extract as well, since that could provide "doy" and "hour", which would work, but I can't seem to match data types in that query either. It's hard to make that "generate_series" a double precision.

SELECT to_char(d, 'MM-DD') AS DAY,
   count(a.lat) AS strikes
FROM
(SELECT generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d) 
AS f
LEFT JOIN
(SELECT to_char(TIME, 'MM-DD') AS day_of_year,
      ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON f = day_of_year
GROUP BY d
ORDER BY d ASC;

Result:

ERROR:  operator does not exist: record = text
LINE 4: ON f = day_of_year group by d order by d asc;
         ^
HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.

Conclusion: I'm aiming at getting daily and hourly total counts that span many years but group by 'MM-DD' and 'MM-DD-HH' (ignoring year), with the query results showing all days/hours even if they are zero.

Later I'll also try to find averages and percentiles for days and hours, so if you have any advice on that, I'm all ears. But my current problem is focused on just getting a complete result for totals.

回答1:

Basically, to cut off the year, to_char(time, 'MMDD') like you already tried does the job. You just forgot to also apply it to the timestamps generated with generate_series()before joining. And some other minor details.

To simplify and for performance and convenience I suggest this simple function to calculate an integer from the pattern 'MMDD' of a given timestamp.

CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';

I used to_char(time, 'MMDD') at first, but switched to the above expression that turned out to be fastest in various tests.

db<>fiddle here

It can be used in expression indexes since it's defined IMMUTABLE. And it still allows function inlining because it only uses EXTRACT (xyz FROM date) - which is implemented with the IMMUTABLE function date_part(text, date) internally. (Note that datepart(text, timestamptz) is only STABLE).

Then this kind of query does the job:

SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM  (
   SELECT f_mmdd(d::date) AS mmdd  -- ignoring the year
   FROM   generate_series(timestamp '2018-01-01'  -- any dummy year
                        , timestamp '2018-12-31'
                        , interval '1 day') d
   ) d
LEFT  JOIN (
   SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
   FROM   counties c
   JOIN   ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
   WHERE  cwa = 'MFR'
   GROUP  BY 1
   ) ct USING (mmdd)
ORDER  BY 1;

Since time (I would use a different column name) is data type timestamptz the cast time::date depends on the time zone setting of your current session. ("Days" are defined by the time zone you are in.) To get immutable (but slower) results use the AT TIME ZONE construct with a time zone name like:

SELECT f_mmdd((time AT TIME ZONE 'Europe/Vienna')::date) ...

Details:

  • Ignoring time zones altogether in Rails and PostgreSQL

Format mmdd any way you like for display.

The cast to integer is optional for the purpose of this particular query. But since you plan to do all kinds of queries, you'll end up wanting an index on the expression:

CREATE INDEX ltg_data_mmdd_idx ON event(f_mmdd(time));

(Not needed for this query.)
integer is a bit faster for this purpose. And you need the (otherwise optional) function wrapper for this since to_char() is only defined STABLE, but we need IMMUTABLE for the index. The updated expression (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int is IMMUTABLE, but the function wrapper is still convenient.

Related:

  • How do you do date math that ignores the year?
  • Generating time series between two dates in PostgreSQL