How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so I cant just GROUP BY
columns and use a NOT IN
statement.
I'm looking for a single SQL statement, not a solution that requires me to create temporary table and insert records into that. I know how to do that but requires more work to fit into my automated process.
Table definition:
jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
Column | Type | Modifiers
------------+---------+-----------
label | text |
release_id | integer |
catno | text |
Indexes:
"releases_labels_catno_idx" btree (catno)
"releases_labels_name_idx" btree (label)
Foreign-key constraints:
"foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)
Sample data:
jthinksearch=> select * from releases_labels where release_id=6155;
label | release_id | catno
--------------+------------+------------
Warp Records | 6155 | WAP 39 CDR
Warp Records | 6155 | WAP 39 CDR
If you can afford to rewrite the whole table, this is probably the simplest approach:
If you need to specifically target the duplicated records, you can make use of the internal
ctid
field, which uniquely identifies a row:Be very careful with
ctid
; it changes over time. But you can rely on it staying the same within the scope of a single statement.Since you also need to avoid duplicates in the future, you could add a surrogate key and a unique constraint while dedupping:
Single SQL statement
Here is a solution that deletes duplicates in place:
Since there is no unique key I am (ab)using the tuple ID
ctid
for the purpose. The physically first row survives in each set of dupes.ctid
is a system column that is not part of the associated row type, so when referencing the whole row with table aliases in the expressionr1 = r
, only visible columns are compared (not thectid
or others). That's why the whole row can be equal and onectid
is still smaller than the other.With only few duplicates, this is also the fastest of all solutions.
With lots of duplicates other solutions are faster.
Then I suggest:
Why does it work with NULL values?
This is somewhat surprising. The reason is explained in the chapter Composite Type Comparison in the manual:
Bold emphasis mine.
Alternatives with second table
I removed that section, because the solution with a data-modifying CTE provided by @Nick is better.
As you have no primary key, there is no easy way to distinguish one duplicated line from any other one. That's one of the reasons why it is highly recommended that any table have a primary key (*).
So you are left with only 2 solutions :
use procedural SQL and a cursor either from a procedural language such as Python or [put here your prefered language] or with PL/pgSQL. Something like (beware untested) :
should delete duplicate lines and return the number of lines actually deleted. It is not necessarily the most efficient way, but you only touch rows that need to be deleted so you will not have to lock whole table.
(*) hopefully PostgreSQL offers the
ctid
pseudo column that you can use as a key. If you table contains anoid
column, you can also use it as it will never change.(**) PostgreSQL
WITH
allows you to do that in in single SQL statementThis two points from answer from Nick Barnes
You can try like this: