I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:
WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;
But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE
after the WITH
clause?
I like @erwin-brandstetter 's solution, but wanted to show a solution with the
USING
keyword:If you want to review the records before deleting them, then simply replace
DELETE
withSELECT *
andUSING
with a comma,
, i.e.Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a
NOT IN (...)
clause as those generate a lot of rows in the subquery.If you rewrite the query to use
IN (...)
then it performs similarly to the solution presented here, but the SQL code becomes much less concise.Update 2: If you have
NULL
values in one of the key columns (which you really shouldn't IMO), then you can useCOALESCE()
in the condition for that column, e.g.The valid syntax is specified at http://www.postgresql.org/docs/current/static/sql-delete.html
I would ALTER your table to add a unique auto-incrementing primary key id so that you can run a query like the following which will keep the first of each set of duplicates (ie the one with the lowest id). Note that adding the key is a bit more complicated in Postgres than some other DBs.
If you want to keep one row out of duplicate rows in the table.
This will create a table which you can copy.
Before copying table please delete the column 'row_n'
Here is what I came up with, using a
group by
It deletes the duplicates, preserving the oldest record that has duplicates.
In a perfect world, every table has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the
ctid
column:The above query is short, conveniently listing column names only once.
NOT IN (SELECT ...)
is a tricky query style when NULL values can be involved, but the system columnctid
is never NULL. See:Using
EXISTS
as demonstrated by @Gordon is typically faster. So is a self-join with theUSING
clause like @isapir added later. Both should result in the same query plan.But note an important difference: These other queries treat
NULL
values as not equal, whileGROUP BY
(orDISTINCT
orDISTINCT ON ()
) treats NULL values as equal. Does not matter if key columns are definedNOT NULL
. Else, depending on your definition of "duplicate", you'll need one or the other approach. Or useIS NOT DISTINCT FROM
in comparison of values (which may not be able to use some indexes).Disclaimer:
ctid
is an internal implementation detail of Postgres, it's not in the SQL standard and can be changed between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
How to use the physical location of rows (ROWID) in a DELETE statement
Aside:
The target of a
DELETE
statement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.If you want a unique identifier for every row, you could just add one (a serial, or a guid), and treat it like a surrogate key.