postgresql (redshift) maximum value for a specific

2020-06-24 11:10发布

问题:

I'm working on redshift - I have a table like

userid  oid version number_of_objects
1       ab  1       10
1       ab  2       20
1       ab  3       17
1       ab  4       16
1       ab  5       14
1       cd  1       5
1       cd  2       6
1       cd  3       9
1       cd  4       12
2       ef  1       4
2       ef  2       3
2       gh  1       16
2       gh  2       12
2       gh  3       21

I would like to select from this table the maximum version number for every oid and get the userid and the number of the row.

When I tried this, unfortunately I've got the whole table back:

SELECT MAX(version), oid, userid, number_of_objects
FROM table
GROUP BY oid, userid, number_of_objects
LIMIT 10;

But the real result, what I'm looking for would be:

userid  oid MAX(version)    number_of_objects
1       ab  5               14
1       cd  4               12
2       ef  2               3
2       gh  3               21

Somehow distinct on doesn't work either, it says:

SELECT DISTINCT ON is not supported

Do you have any idea?


UPDATE: in the meantime I came up with this workaround, but I feel like this is not the smartest solution. It's also very slow. But it works at least. Just in case:

SELECT * FROM table,
   (SELECT MAX(version) as maxversion, oid, userid
    FROM table
    GROUP BY oid, userid
    ) as maxtable
    WHERE  table.oid = maxtable.oid
   AND table.userid = maxtable.userid
   AND table.version = maxtable.version
LIMIT 100;

Do you have any better solution?

回答1:

If redshift does have window functions, you might try this:

SELECT * 
FROM (
  select oid, 
         userid, 
         version,
         max(version) over (partition by oid, userid) as max_version, 
  from the_table
) t
where version = max_version;

I would expect that to be faster than a self join with a group by.

Another option would be to use the row_number() function:

SELECT * 
FROM (
  select oid, 
         userid, 
         version,
         row_number() over (partition by oid, userid order by version desc) as rn, 
  from the_table
) t
where rn = 1;

It's more a matter of personal taste which one to use. Performance wise I wouldn't expect a difference.



回答2:

select      distinct
            first_value(userid) over(
                  partition by oid 
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as userid
            , oid
            , first_value(version) over(
                  partition by oid
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as max_version
            , first_value(number_of_objects) over(
                  partition by oid
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as number_of_objects

from        table
order by    oid;

AWS Redshift Documentation first_value

Don't forget nulls last in the ordering if version is nullable.



回答3:

Long story short: Horses for courses.

The author's approach should be faster on smaller tables and to extract sample data but the window methods will be more consistent in performance and faster on the whole table.

Below are a few explain results I made on table with 17 columns, 184 121 798 rows and 12 809 740 unique ids (an average of 14 versions per id, but can go up to 40).

Quick summary:

Tomi's approach: cost=5983958.76..67801689853856.94 (6*10^6 for first row, and 7*10^13 for whole table)

@a_horse_with_no_name approach: cost=1000027117538.39..1000031720583.59 (10^12 for any query)

@Merlin: almost exact to above method.

Original approach

explain
SELECT * FROM table t,
(SELECT MAX(version) as maxversion, id
 FROM table
 GROUP BY id
) as maxtable
WHERE  t.id = maxtable.id
       AND t.version = maxtable.maxversion;
XN Hash Join DS_DIST_NONE  (cost=5983958.76..67801689853856.94 rows=63811541 width=590)
  Hash Cond: ((("outer".id)::text = ("inner".id)::text) AND ("outer".version = "inner".maxversion))
  ->  XN Seq Scan on equipment_visits ev  (cost=0.00..1841218.08 rows=184121808 width=418)
  ->  XN Hash  (cost=5063349.72..5063349.72 rows=184121808 width=172)
        ->  XN Subquery Scan maxtable  (cost=2761827.12..5063349.72 rows=184121808 width=172)
              ->  XN HashAggregate  (cost=2761827.12..3222131.64 rows=184121808 width=44)
                    ->  XN Seq Scan on equipment_visits  (cost=0.00..1841218.08 rows=184121808 width=44)

So, cost of the first and all rows are 5983958.76 (6*10^6) and 67801689853856.94 (7*10^13).

a_horse_with_no_name's approach

Both solutions provided by the @a_horse_with_no_name have almost exact plan so I'll paste only one of them

explain
SELECT * 
FROM (
  select *,
         row_number() over (partition by id order by version desc) as rn
  from table
)
where rn = 1;

gives

  Filter: (rn = 1)
  ->  XN Window  (cost=1000027117538.39..1000029419060.99 rows=184121808 width=44)
        Partition: id
        Order: version
        ->  XN Sort  (cost=1000027117538.39..1000027577842.91 rows=184121808 width=44)
              Sort Key: id, version
              ->  XN Seq Scan on table  (cost=0.00..1841218.08 rows=184121808 width=44)

Merlin's approach

Solution provided by @Merlin seems to be incomplete, as it doesn't return all values for the latest version, but it has similar performance to second option

explain
select      distinct
              id
            , first_value(version) over(
                  partition by id
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as max_version
            , first_value(additional_col) over(
                  partition by id
                  order by version desc
                  rows between unbounded preceding and unbounded following
                  ) as additional_col

from        table t;

gives

XN Unique  (cost=1000027117538.39..1000032180888.11 rows=184121808 width=84)
  ->  XN Window  (cost=1000027117538.39..1000030799974.55 rows=184121808 width=84)
        Partition: id
        Order: version
        ->  XN Sort  (cost=1000027117538.39..1000027577842.91 rows=184121808 width=84)
              Sort Key: id, version
              ->  XN Seq Scan on table  (cost=0.00..1841218.08 rows=184121808 width=84)