可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have an updates
table in Postgres is 9.4.5 like this:
goal_id | created_at | status
1 | 2016-01-01 | green
1 | 2016-01-02 | red
2 | 2016-01-02 | amber
And a goals
table like this:
id | company_id
1 | 1
2 | 2
I want to create a chart for each company that shows the state of all of their goals, per week.
I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.
What I have so far:
SELECT EXTRACT(year from generate_series) AS year,
EXTRACT(week from generate_series) AS week,
u.company_id,
COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
SELECT DISTINCT ON(year, week)
goals.company_id,
updates.status,
EXTRACT(week from updates.created_at) week,
EXTRACT(year from updates.created_at) AS year,
updates.created_at
FROM updates
JOIN goals ON goals.id = updates.goal_id
ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3
But this has two problems. It seems that the join on u
isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.
This is some pretty complicated SQL and I love some input on how to pull it off.
Table structures and info
The goals table has around ~1000 goals ATM and is growing about ~100 a week:
Table "goals"
Column | Type | Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('goals_id_seq'::regclass)
company_id | integer | not null
name | text | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goals_pkey" PRIMARY KEY, btree (id)
"entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
"goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT
The updates
table has around ~1000 and is growing around ~100 a week:
Table "updates"
Column | Type | Modifiers
------------+-----------------------------+------------------------------------------------------------------
id | integer | not null default nextval('updates_id_seq'::regclass)
status | entity.goalstatus | not null
goal_id | integer | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goal_updates_pkey" PRIMARY KEY, btree (id)
"entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
"updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE
Schema | Name | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
entity | entity.goalstatus | goalstatus | 4 | green +| |
| | | | amber +| |
| | | | red | |
回答1:
You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN
between generate_series()
and goals
. The (possibly) expensive part is to get the current state
from updates
for each. Like @Paul already suggested, a LATERAL
join seems like the best tool. Do it only for updates
, though, and use a faster technique with LIMIT 1
.
And simplify date handling with date_trunc()
.
SELECT w_start
, g.company_id
, count(*) FILTER (WHERE u.status = 'green') AS green_count
, count(*) FILTER (WHERE u.status = 'amber') AS amber_count
, count(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(date_trunc('week', NOW() - interval '2 months')
, date_trunc('week', NOW())
, interval '1 week') w_start
CROSS JOIN goals g
LEFT JOIN LATERAL (
SELECT status
FROM updates
WHERE goal_id = g.id
AND created_at < w_start
ORDER BY created_at DESC
LIMIT 1
) u ON true
GROUP BY w_start, g.company_id
ORDER BY w_start, g.company_id;
To make this fast you need a multicolumn index:
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
Descending order for created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)
Index columns in that order. Why?
- Multicolumn index and performance
And the third column status
is only appended to allow fast index-only scans on updates
. Related case:
- Slow index scans in large table
1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:
EXTRACT(isoyear from w_start) AS year
, EXTRACT(week from w_start) AS week
Best with ISOYEAR
, like @Paul explained.
SQL Fiddle.
Related:
- What is the difference between LATERAL and a subquery in PostgreSQL?
- Optimize GROUP BY query to retrieve latest record per user
- Select first row in each GROUP BY group?
- PostgreSQL: running count of rows for a query 'by minute'
回答2:
This seems like a good use for LATERAL
joins:
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
SELECT DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
WHERE u2.created_at <= s.w
ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u
ON true
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id
ORDER BY u.company_id, year, week
;
Btw I am extracting ISOYEAR
not YEAR
to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')
is 2016
but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')
is 53
!
EDIT: You should test on your real data, but I feel like this ought to be faster:
SELECT year,
week,
company_id,
COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM (
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
u.goal_id,
(array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
LEFT OUTER JOIN (
SELECT g.company_id, u2.goal_id, u2.created_at, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
) u
ON s.t >= u.created_at
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id, u.goal_id
) x
GROUP BY year, week, company_id
ORDER BY company_id, year, week
;
Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1]
with a real first
function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.
回答3:
I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.
I insert the following records;
Company
Goals
Updates
After that I wrote the following query, for correction
SELECT c.id company_id, c.name company_name, u.status goal_status,
EXTRACT(week from u.created_at) goal_status_week,
EXTRACT(year from u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id
ORDER BY goal_status_year DESC, goal_status_week DESC;
I get the following results;
At last I merge this query with week series
SELECT
gs.company_id,
gs.company_name,
gs.goal_status,
EXTRACT(year from w) AS year,
EXTRACT(week from w) AS week,
COUNT(gs.*) cnt
FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w
LEFT JOIN(
SELECT c.id company_id, c.name company_name, u.status goal_status,
EXTRACT(week from u.created_at) goal_status_week,
EXTRACT(year from u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id ) gs
ON gs.goal_status_week = EXTRACT(week from w) AND gs.goal_status_year = EXTRACT(year from w)
GROUP BY company_id, company_name, goal_status, year, week
ORDER BY year DESC, week DESC;
I get this result
Have a good day.