Aggregating the most recent joined records per wee

2019-06-16 09:25发布

站内文章 / 前沿技术

25 0

来，给爷笑一个

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have an updates table in Postgres is 9.4.5 like this:

goal_id    | created_at | status
1          | 2016-01-01 | green
1          | 2016-01-02 | red
2          | 2016-01-02 | amber

And a goals table like this:

id | company_id
1  | 1
2  | 2

I want to create a chart for each company that shows the state of all of their goals, per week.

I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.

What I have so far:

SELECT EXTRACT(year from generate_series) AS year, 
       EXTRACT(week from generate_series) AS week,
       u.company_id,
       COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
       COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
       COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
  SELECT DISTINCT ON(year, week)
         goals.company_id,
         updates.status, 
         EXTRACT(week from updates.created_at) week,
         EXTRACT(year from updates.created_at) AS year,
         updates.created_at 
  FROM updates
  JOIN goals ON goals.id = updates.goal_id
  ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3

But this has two problems. It seems that the join on u isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.

This is some pretty complicated SQL and I love some input on how to pull it off.

Table structures and info

The goals table has around ~1000 goals ATM and is growing about ~100 a week:

                                           Table "goals"
     Column      |            Type             |                         Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
 id              | integer                     | not null default nextval('goals_id_seq'::regclass)
 company_id      | integer                     | not null
 name            | text                        | not null
 created_at      | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at      | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goals_pkey" PRIMARY KEY, btree (id)
    "entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
    "goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT

The updates table has around ~1000 and is growing around ~100 a week:

                                         Table "updates"
   Column   |            Type             |                            Modifiers
------------+-----------------------------+------------------------------------------------------------------
 id         | integer                     | not null default nextval('updates_id_seq'::regclass)
 status     | entity.goalstatus           | not null
 goal_id    | integer                     | not null
 created_at | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goal_updates_pkey" PRIMARY KEY, btree (id)
    "entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
    "updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE

 Schema |       Name        | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
 entity | entity.goalstatus | goalstatus    | 4    | green   +|                   |
        |                   |               |      | amber   +|                   |
        |                   |               |      | red      |                   |

回答1:

You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN between generate_series() and goals. The (possibly) expensive part is to get the current state from updates for each. Like @Paul already suggested, a LATERAL join seems like the best tool. Do it only for updates, though, and use a faster technique with LIMIT 1.

And simplify date handling with date_trunc().

SELECT w_start
     , g.company_id
     , count(*) FILTER (WHERE u.status = 'green') AS green_count
     , count(*) FILTER (WHERE u.status = 'amber') AS amber_count
     , count(*) FILTER (WHERE u.status = 'red')   AS red_count
FROM   generate_series(date_trunc('week', NOW() - interval '2 months')
                     , date_trunc('week', NOW())
                     , interval '1 week') w_start
CROSS  JOIN goals g
LEFT   JOIN LATERAL (
   SELECT status
   FROM   updates
   WHERE  goal_id = g.id
   AND    created_at < w_start
   ORDER  BY created_at DESC
   LIMIT  1
   ) u ON true
GROUP  BY w_start, g.company_id
ORDER  BY w_start, g.company_id;

To make this fast you need a multicolumn index:

CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);

Descending order for created_at is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)

Index columns in that order. Why?

Multicolumn index and performance

And the third column status is only appended to allow fast index-only scans on updates. Related case:

Slow index scans in large table

1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.

w_start represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:

   EXTRACT(isoyear from w_start) AS year
 , EXTRACT(week    from w_start) AS week

Best with ISOYEAR, like @Paul explained.

SQL Fiddle.

What is the difference between LATERAL and a subquery in PostgreSQL?
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
PostgreSQL: running count of rows for a query 'by minute'

回答2:

This seems like a good use for LATERAL joins:

SELECT  EXTRACT(ISOYEAR FROM s) AS year,
        EXTRACT(WEEK FROM s) AS week,
        u.company_id,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM    generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
  SELECT  DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
  FROM    updates u2
  INNER JOIN goals g
  ON      g.id = u2.goal_id
  WHERE   u2.created_at <= s.w
  ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u 
ON true
WHERE   u.company_id IS NOT NULL
GROUP BY year, week, u.company_id
ORDER BY u.company_id, year, week
;

Btw I am extracting ISOYEAR not YEAR to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08') is 2016 but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08') is 53!

EDIT: You should test on your real data, but I feel like this ought to be faster:

SELECT  year,
        week,
        company_id,
        COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
        COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
        COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM    (
  SELECT  EXTRACT(ISOYEAR FROM s) AS year,
          EXTRACT(WEEK FROM s) AS week,
          u.company_id,
          u.goal_id,
          (array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
  FROM    generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
  LEFT OUTER JOIN ( 
    SELECT  g.company_id, u2.goal_id, u2.created_at, u2.status
    FROM    updates u2
    INNER JOIN goals g 
    ON      g.id = u2.goal_id
  ) u 
  ON      s.t >= u.created_at
  WHERE   u.company_id IS NOT NULL
  GROUP BY year, week, u.company_id, u.goal_id
) x
GROUP BY year, week, company_id
ORDER BY company_id, year, week
;

Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1] with a real first function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.

回答3:

I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.

I insert the following records;

Company

Goals

Updates

After that I wrote the following query, for correction

SELECT c.id company_id, c.name company_name, u.status goal_status, 
         EXTRACT(week from u.created_at) goal_status_week,
         EXTRACT(year from u.created_at) AS goal_status_year 
FROM company c
INNER JOIN goals g ON g.company_id = c.id 
INNER JOIN updates u ON u.goal_id = g.id
ORDER BY goal_status_year DESC, goal_status_week DESC;

I get the following results;

At last I merge this query with week series

SELECT
             gs.company_id,
             gs.company_name,
             gs.goal_status,
             EXTRACT(year from w) AS year, 
       EXTRACT(week from w) AS week,
             COUNT(gs.*) cnt
FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w
LEFT JOIN(
SELECT c.id company_id, c.name company_name, u.status goal_status, 
             EXTRACT(week from u.created_at) goal_status_week,
       EXTRACT(year from u.created_at) AS goal_status_year 
FROM company c
INNER JOIN goals g ON g.company_id = c.id 
INNER JOIN updates u ON u.goal_id = g.id ) gs 
ON gs.goal_status_week = EXTRACT(week from w) AND gs.goal_status_year = EXTRACT(year from w)
GROUP BY company_id, company_name, goal_status, year, week
ORDER BY  year DESC, week DESC;

I get this result