Optimize GROUP BY query to retrieve latest record

I have the following table (simplified form) in Postgres 9.2

CREATE TABLE user_msg_log (
    aggr_date DATE,
    user_id INTEGER,
    running_total INTEGER
);

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. running_total is always increasing for each user.

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

SELECT user_id, max(aggr_date), max(running_total) 
FROM user_msg_log 
WHERE aggr_date <= :mydate 
GROUP BY user_id

which is extremely slow. I have also tried:

SELECT DISTINCT ON(user_id), aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC;

which has the same plan and is equally slow.

So far I have a single index on user_msg_log(aggr_date), but doesn't help much. Is there any other index I should use to speed this up, or any other way to achieve what I want?

标签： sql postgresql indexing greatest-n-per-group postgresql-performance

3条回答

刘海飞了

2楼-- · 2018-12-31 03:38

This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users table to exploit the locality of the index on user_msg_log.

SELECT u.user_id, l.aggr_date, l.running_total
  FROM (SELECT user_id FROM users ORDER BY user_id) u,
       LATERAL (SELECT aggr_date, running_total
                  FROM user_msg_log
                 WHERE user_id = u.user_id -- lateral reference
                   AND aggr_date <= :mydate
              ORDER BY aggr_date DESC NULLS LAST
                 LIMIT 1) l;

The rationale is that index lookup is expensive if user_id values are random. By sorting out user_id first, the subsequent lateral join would be like a simple scan on the index of user_msg_log. Even though both query plans look alike, the running time would differ much especially for large tables.

The cost of the sorting is minimal especially if there is an index on the user_id field.

0人赞添加讨论(0) 举报

与君花间醉酒

3楼-- · 2018-12-31 03:39

For best read performance you need a multicolumn index:

CREATE INDEX user_msg_log_combo_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST)

To make index only scans possible, add the otherwise not needed column running_total:

CREATE INDEX user_msg_log_combo_covering_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST, running_total)

Why DESC NULLS LAST?

Unused index in range of dates query

For few rows per user_id or small tables a simple DISTINCT ON is among the fastest and simplest solutions:

Select first row in each GROUP BY group?

For many rows per user_id a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:

1. No separate table with unique users

The following solutions go beyond what's covered in the Postgres Wiki.
^{With a separate users table, solutions in 2. below are typically simpler and faster.}

1a. Recursive CTE with `LATERAL` join

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT user_id, aggr_date, running_total
   FROM   user_msg_log
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT u.user_id, u.aggr_date, u.running_total
   FROM   cte c
   ,      LATERAL (
      SELECT user_id, aggr_date, running_total
      FROM   user_msg_log
      WHERE  user_id > c.user_id   -- lateral reference
      AND    aggr_date <= :mydate  -- repeat condition
      ORDER  BY user_id, aggr_date DESC NULLS LAST
      LIMIT  1
      ) u
   )
SELECT user_id, aggr_date, running_total
FROM   cte
ORDER  BY user_id;

This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.

1b. Recursive CTE with correlated subqueries

Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

WITH RECURSIVE cte AS (
   (
   SELECT u  -- whole row
   FROM   user_msg_log u
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT u1  -- again, whole row
           FROM   user_msg_log u1
           WHERE  user_id > (c.u).user_id  -- parentheses to access row type
           AND    aggr_date <= :mydate     -- repeat predicate
           ORDER  BY user_id, aggr_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.u).user_id IS NOT NULL        -- any NOT NULL column of the row
   )
SELECT (u).*                               -- finally decompose row
FROM   cte
WHERE  (u).user_id IS NOT NULL             -- any column defined NOT NULL
ORDER  BY (u).user_id;

It could be misleading to test the row value with c.u IS NOT NULL. This only returns true if every single column of the tested row is NOT NULL and would fail if a single NULL value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is defined NOT NULL (like the primary key). More:

More explanation for this query in chapter 2b. below.
Related answers:

2. With separate `users` table

Table layout hardly matters as long as we have exactly one row per relevant user_id. Example:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

Ideally, the table is physically sorted. See:

Optimize Postgres timestamp query range

Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.

2a. `LATERAL` join

SELECT u.user_id, l.aggr_date, l.running_total
FROM   users u
CROSS  JOIN LATERAL (
   SELECT aggr_date, running_total
   FROM   user_msg_log
   WHERE  user_id = u.user_id  -- lateral reference
   AND    aggr_date <= :mydate
   ORDER  BY aggr_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL allows to reference preceding FROM items on the same query level. You get one index (-only) look-up per user.

What is the difference between LATERAL and a subquery in PostgreSQL?

Consider the possible improvement by sorting the users table suggested by Gang Liang in another answer. If the physical sort order of the users table happens to match the index on user_msg_log, you don't need this.

You don't get results for users missing in the users table, even if you have entries in user_msg_log. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.

You also don't get a row for any user that has no matching entry in user_msg_log. That conforms to your original question. If you need to include those rows in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

Call a set-returning function with an array argument multiple times

This form is also best to retrieve more than one rows (but not all) per user. Just use LIMIT n instead of LIMIT 1.

Effectively, all of these would do the same:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The latter has a lower priority, though. Explicit JOIN binds before comma.

2b. Correlated subquery

Good choice to retrieve a single column from a single row. Code example:

Optimize groupwise maximum query

The same is possible for multiple columns, but you need more smarts:

CREATE TEMP TABLE combo (aggr_date date, running_total int);

SELECT user_id, (my_combo).*  -- note the parentheses
FROM (
   SELECT u.user_id
        , (SELECT (aggr_date, running_total)::combo
           FROM   user_msg_log
           WHERE  user_id = u.user_id
           AND    aggr_date <= :mydate
           ORDER  BY aggr_date DESC NULLS LAST
           LIMIT  1) AS my_combo
   FROM   users u
   ) sub;

Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in user_msg_log. You get NULL for my_combo, which you can easily filter with a WHERE clause in the outer query if need be.
_{Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include a NOT NULL column in the subquery to be sure.}
A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with CREATE TYPE, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type: (aggr_date, running_total)::combo
Finally, we do not want to decompose combo on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.

Get values from first and last row per group

Demonstrating all 4 queries with 100k log entries and 1k users:
SQL Fiddle - pg 9.6
db<>fiddle here - pg 10

0人赞添加讨论(0) 举报

冷夜・残月

4楼-- · 2018-12-31 03:40

Perhaps a different index on the table would help. Try this one: user_msg_log(user_id, aggr_date). I am not positive that Postgres will make optimal use with distinct on.

So, I would stick with that index and try this version:

select *
from user_msg_log uml
where not exists (select 1
                  from user_msg_log uml2
                  where uml2.user_id = u.user_id and
                        uml2.aggr_date <= :mydate and
                        uml2.aggr_date > uml.aggr_date
                 );

This should replace the sorting/grouping with index look ups. It might be faster.

0人赞添加讨论(0) 举报