I need to find duplicates in a table. In MySQL I simply write:
SELECT *,count(id) count FROM `MY_TABLE`
GROUP BY SOME_COLUMN ORDER BY count DESC
This query nicely:
- Finds duplicates based on SOME_COLUMN, giving its repetition count.
- Sorts in desc order of repetition, which is useful to quickly scan major dups.
- Chooses a random value for all remaining columns, giving me an idea of values in those columns.
Similar query in Postgres greets me with an error:
column "MY_TABLE.SOME_COLUMN" must appear in the GROUP BY clause or be used in an aggregate function
What is the Postgres equivalent of this query?
PS: I know that MySQL behaviour deviates from SQL standards.
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named
"MY_TABLE"
(all upper case). If you (more wisely) named itmy_table
(all lower case), then you can remove the double quotes or use lower case.Also, I use
ct
instead ofcount
as alias, because it is bad practice to use function names as identifiers.Simple case
This would work with PostgreSQL 9.1:
It requires primary key column(s) in the
GROUP BY
clause. The results are identical to a MySQL query, butct
would always be 1 (or 0 ifid IS NULL
) - useless to find duplicates.Group by other than primary key columns
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query - and you can use
*
.This works because
DISTINCT ON
(PostgreSQL specific), likeDISTINCT
(SQL-Standard), are applied after the window functioncount(*) OVER (...)
. Window functions (with theOVER
clause) require PostgreSQL 8.4 or later and are not available in MySQL.Works with any table, regardless of primary or unique constraints.
The
1
inDISTINCT ON
andORDER BY
is just shorthand to refer to the ordinal number of the item in theSELECT
list.SQL Fiddle to demonstrate both side by side.
More details in this closely related answer:
count(*)
vs.count(id)
If you are looking for duplicates, you are better off with
count(*)
than withcount(id)
. There is a subtle difference ifid
can beNULL
, becauseNULL
values are not counted - whilecount(*)
counts all rows. Ifid
is definedNOT NULL
, results are the same, butcount(*)
is generally more appropriate (and slightly faster, too).Here is a self-joined CTE, which allows you to use
select *
. key0 is the intended unique key, {key1,key2} are the additional key elements needed to address the currently non-unique rows. Use at your own risk, YMMV.BTW: to get the intended behaviour for the OP, the
HAVING COUNT(*) > 1
clause should be omitted.mysql allows
group by
to omit non-aggregated selected columns from thegroup by
list, which it executes by returning the first row found for each unique combination of grouped by columns. This is non-standard SQL behaviour.postgres on the other hand is SQL standard compliant.
There is no equivalent query in postgres.
Here's another approach, uses DISTINCT ON:
Data source:
Output:
Live test: http://www.sqlfiddle.com/#!1/e2509/1
DISTINCT ON documentation: http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html