I have an updates
table in Postgres is 9.4.5 like this:
goal_id | created_at | status
1 | 2016-01-01 | green
1 | 2016-01-02 | red
2 | 2016-01-02 | amber
And a goals
table like this:
id | company_id
1 | 1
2 | 2
I want to create a chart for each company that shows the state of all of their goals, per week.
I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.
What I have so far:
SELECT EXTRACT(year from generate_series) AS year,
EXTRACT(week from generate_series) AS week,
u.company_id,
COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
SELECT DISTINCT ON(year, week)
goals.company_id,
updates.status,
EXTRACT(week from updates.created_at) week,
EXTRACT(year from updates.created_at) AS year,
updates.created_at
FROM updates
JOIN goals ON goals.id = updates.goal_id
ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3
But this has two problems. It seems that the join on u
isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.
This is some pretty complicated SQL and I love some input on how to pull it off.
Table structures and info
The goals table has around ~1000 goals ATM and is growing about ~100 a week:
Table "goals"
Column | Type | Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('goals_id_seq'::regclass)
company_id | integer | not null
name | text | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goals_pkey" PRIMARY KEY, btree (id)
"entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
"goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT
The updates
table has around ~1000 and is growing around ~100 a week:
Table "updates"
Column | Type | Modifiers
------------+-----------------------------+------------------------------------------------------------------
id | integer | not null default nextval('updates_id_seq'::regclass)
status | entity.goalstatus | not null
goal_id | integer | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goal_updates_pkey" PRIMARY KEY, btree (id)
"entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
"updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE
Schema | Name | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
entity | entity.goalstatus | goalstatus | 4 | green +| |
| | | | amber +| |
| | | | red | |
You need one data item per week and goal (before aggregating counts per company). That's a plain
CROSS JOIN
betweengenerate_series()
andgoals
. The (possibly) expensive part is to get the currentstate
fromupdates
for each. Like @Paul already suggested, aLATERAL
join seems like the best tool. Do it only forupdates
, though, and use a faster technique withLIMIT 1
.And simplify date handling with
date_trunc()
.To make this fast you need a multicolumn index:
Descending order for
created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)Index columns in that order. Why?
And the third column
status
is only appended to allow fast index-only scans onupdates
. Related case:1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:Best with
ISOYEAR
, like @Paul explained.SQL Fiddle.
Related:
This seems like a good use for
LATERAL
joins:Btw I am extracting
ISOYEAR
notYEAR
to ensure I get sensible results around the beginning of January. For instanceEXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')
is2016
butEXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')
is53
!EDIT: You should test on your real data, but I feel like this ought to be faster:
Still no window functions though. :-) Also you can speed it up a bit more by replacing
(array_agg(...))[1]
with a realfirst
function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.
I insert the following records;
Company
Goals
Updates
After that I wrote the following query, for correction
I get the following results;
At last I merge this query with week series
I get this result
Have a good day.