I have the following table (simplified form) in Postgres 9.2
CREATE TABLE user_msg_log (
aggr_date DATE,
user_id INTEGER,
running_total INTEGER
);
It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. running_total is always increasing for each user.
I want to efficiently retrieve the latest record for each user before a specific date. My query is:
SELECT user_id, max(aggr_date), max(running_total)
FROM user_msg_log
WHERE aggr_date <= :mydate
GROUP BY user_id
which is extremely slow. I have also tried:
SELECT DISTINCT ON(user_id), aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC;
which has the same plan and is equally slow.
So far I have a single index on user_msg_log(aggr_date), but doesn't help much. Is there any other index I should use to speed this up, or any other way to achieve what I want?
This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the
users
table to exploit the locality of the index onuser_msg_log
.The rationale is that index lookup is expensive if
user_id
values are random. By sorting outuser_id
first, the subsequent lateral join would be like a simple scan on the index ofuser_msg_log
. Even though both query plans look alike, the running time would differ much especially for large tables.The cost of the sorting is minimal especially if there is an index on the
user_id
field.For best read performance you need a multicolumn index:
To make index only scans possible, add the otherwise not needed column
running_total
:Why
DESC NULLS LAST
?For few rows per
user_id
or small tables a simpleDISTINCT ON
is among the fastest and simplest solutions:For many rows per
user_id
a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:1. No separate table with unique users
The following solutions go beyond what's covered in the Postgres Wiki.
With a separate
users
table, solutions in 2. below are typically simpler and faster.1a. Recursive CTE with
LATERAL
joinCommon Table Expressions require Postgres 8.4+.
LATERAL
requires Postgres 9.3+.This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.
1b. Recursive CTE with correlated subqueries
Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.
It could be misleading to test the row value with
c.u IS NOT NULL
. This only returnstrue
if every single column of the tested row isNOT NULL
and would fail if a singleNULL
value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is definedNOT NULL
(like the primary key). More:More explanation for this query in chapter 2b. below.
Related answers:
2. With separate
users
tableTable layout hardly matters as long as we have exactly one row per relevant
user_id
. Example:Ideally, the table is physically sorted. See:
Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.
2a.
LATERAL
joinJOIN LATERAL
allows to reference precedingFROM
items on the same query level. You get one index (-only) look-up per user.Consider the possible improvement by sorting the
users
table suggested by Gang Liang in another answer. If the physical sort order of theusers
table happens to match the index onuser_msg_log
, you don't need this.You don't get results for users missing in the
users
table, even if you have entries inuser_msg_log
. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.You also don't get a row for any user that has no matching entry in
user_msg_log
. That conforms to your original question. If you need to include those rows in the result useLEFT JOIN LATERAL ... ON true
instead ofCROSS JOIN LATERAL
:This form is also best to retrieve more than one rows (but not all) per user. Just use
LIMIT n
instead ofLIMIT 1
.Effectively, all of these would do the same:
The latter has a lower priority, though. Explicit
JOIN
binds before comma.2b. Correlated subquery
Good choice to retrieve a single column from a single row. Code example:
The same is possible for multiple columns, but you need more smarts:
Like
LEFT JOIN LATERAL
above, this variant includes all users, even without entries inuser_msg_log
. You getNULL
formy_combo
, which you can easily filter with aWHERE
clause in the outer query if need be.Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include a
NOT NULL
column in the subquery to be sure.A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with
CREATE TYPE
, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type:(aggr_date, running_total)::combo
Finally, we do not want to decompose
combo
on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.Related:
Demonstrating all 4 queries with 100k log entries and 1k users:
SQL Fiddle - pg 9.6
db<>fiddle here - pg 10
Perhaps a different index on the table would help. Try this one:
user_msg_log(user_id, aggr_date)
. I am not positive that Postgres will make optimal use withdistinct on
.So, I would stick with that index and try this version:
This should replace the sorting/grouping with index look ups. It might be faster.