I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.
What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?
Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction. Optionally, you can drop the temporary table at the end of the transaction automatically with
ON COMMIT DROP
. See below.This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain
DELETE
.You mentioned millions of rows. To make the operation fast you want to allocate enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:
Set
temp_buffers
accordingly. Round up generously because in-memory representation needs a bit more RAM.This method can be superior to creating a new table if depending objects exist. Views, indexes, foreign keys or other objects referencing the table.
TRUNCATE
makes you begin with a clean slate anyway (new file in the background) and is much faster thanDELETE FROM tbl
with big tables (DELETE
can actually be faster with small tables).For big tables, it is regularly faster to drop indexes and foreign keys, refill the table and recreate these objects. As far as fk constraints are concerned you have to be certain the new data is valid of course or you'll run into an exception on trying to create the fk.
Note that
TRUNCATE
requires more aggressive locking thanDELETE
. This may be an issue for tables with heavy, concurrent load.If
TRUNCATE
is not an option or generally for small to medium tables there is a similar technique with a data-modifying CTE (Postgres 9.1+):Slower for big tables, because
TRUNCATE
is faster there. But may be faster (and simpler!) for small tables.If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.
For very big tables that would not fit into available RAM, creating a new table will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.
Generalized query to delete duplicates:
The column
ctid
is a special column available for every table but not visible unless specifically mentioned. Thectid
column value is considered unique for every row in a table.First, you need to decide on which of your "duplicates" you will keep. If all columns are equal, OK, you can delete any of them... But perhaps you want to keep only the most recent, or some other criterion?
The fastest way depends on your answer to the question above, and also on the % of duplicates on the table. If you throw away 50% of your rows, you're better off doing
CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ;
, and if you delete 1% of the rows, using DELETE is better.Also for maintenance operations like this, it's generally good to set
work_mem
to a good chunk of your RAM: run EXPLAIN, check the number N of sorts/hashes, and set work_mem to your RAM / 2 / N. Use lots of RAM; it's good for speed. As long as you only have one concurrent connection...