Postgres, table1 left join table2 with only 1 row

2019-05-01 06:35发布

问题:

Ok, so the title is a bit convoluted. This is basically a greatest-n-per-group type problem, but I can't for the life of me figure it out.

I have a table, user_stats:

------------------+---------+---------------------------------------------------------
 id               | bigint  | not null default nextval('user_stats_id_seq'::regclass)
 user_id          | bigint  | not null
 datestamp        | integer | not null
 post_count       | integer | 
 friends_count    | integer | 
 favourites_count | integer |  
Indexes:
    "user_stats_pk" PRIMARY KEY, btree (id)
    "user_stats_datestamp_index" btree (datestamp)
    "user_stats_user_id_index" btree (user_id)
Foreign-key constraints:
    "user_user_stats_fk" FOREIGN KEY (user_id) REFERENCES user_info(id)

I want to get the stats for each id by latest datestamp. This is a biggish table, somewhere in the neighborhood of 41m rows, so I've created a temp table of user_id, last_date using:

CREATE TEMP TABLE id_max_date AS
    (SELECT user_id, MAX(datestamp) AS date FROM user_stats GROUP BY user_id);

The problem is that datestamp isn't unique since there can be more than 1 stat update in a day (should have been a real timestamp but the guy who designed this was kind of an idiot and theres too much data to go back at the moment). So some IDs have multiple rows when I do the JOIN:

SELECT user_stats.user_id, user_stats.datestamp, user_stats.post_count,
       user_stats.friends_count, user_stats.favorites_count
  FROM id_max_date JOIN user_stats
    ON id_max_date.user_id=user_stats.user_id AND date=datestamp;

If I was doing this as subselects I guess I could LIMIT 1, but I've always heard those are horribly inefficient. Thoughts?

回答1:

DISTINCT ON is your friend.

select distinct on (user_id) * from user_stats order by datestamp desc;


回答2:

Basically you need to decide how to resolve ties, and you need some other column besides datestamp which is guaranteed to be unique (at least over a given user) so it can be used as the tiebreaker. If nothing else, you can use the id primary key column.

Another solution if you're using PostgreSQL 8.4 is windowing functions:

WITH numbered_user_stats AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY datestamp DESC) AS RowNum
    FROM user_stats) AS numbered_user_stats
) SELECT u.user_id, u.datestamp, u.post_count, u.friends_count, u.favorites_count
FROM numbered_user_stats AS u
WHERE u.RowNum = 1;


回答3:

Using the existing infrastructure, you can use:

SELECT u.user_id, u.datestamp,
       MAX(u.post_count)      AS post_count,
       MAX(u.friends_count)   AS friends_count,
       MAX(u.favorites_count) AS favorites_count
  FROM id_max_date AS m JOIN user_stats AS u
    ON m.user_id = u.user_id AND m.date = u.datestamp
 GROUP BY u.user_id, u.datestamp;

This gives you a single value for each of the 'not necessarily unique' columns. However, it does not absolutely guarantee that the three maxima all appeared in the same row (though there is at least a moderate chance that they will - and that they will all come from the last of entries created on the given day).

For this query, the index on date stamp alone is no help; an index on user ID and date stamp could speed this query up considerably - or, perhaps more accurately, it could speed up the query that generates the id_max_date table.

Clearly, you can also write the id_max_date expression as a sub-query in the FROM clause:

SELECT u.user_id, u.datestamp,
       MAX(u.post_count)      AS post_count,
       MAX(u.friends_count)   AS friends_count,
       MAX(u.favorites_count) AS favorites_count
  FROM (SELECT u2.user_id, MAX(u2.datestamp) AS date
          FROM user_stats AS u2
         GROUP BY u2.user_id) AS m
  JOIN user_stats AS u ON m.user_id = u.user_id AND m.date = u.datestamp
 GROUP BY u.user_id, u.datestamp;