I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.
What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?
If you have only one or a few duplicated entries, and they are indeed duplicated (that is, they appear twice), you can use the "hidden"
ctid
column, as proposed above, together withLIMIT
:This will delete only the first of the selected rows.
I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not actually removing the duplicates. In running some tests, I found that adding the "DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.
You can use oid or ctid, which is normally a "non-visible" columns in the table:
For example you could:
From an old postgresql.org mailing list:
Unique values
Duplicate values
One more double duplicate
Select duplicate rows
Delete duplicate rows
Note: PostgreSQL dosn't support aliases on the table mentioned in the
from
clause of a delete.Some of these approaches seem a little complicated, and I generally do this as:
Given table
table
, want to unique it on (field1, field2) keeping the row with the max field3:For example, I have a table,
user_accounts
, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).USING
is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.