SQL Query - Delete duplicates if more than 3 dups?

2020-02-12 06:45发布

Does anyone have an elegant sql statement to delete duplicate records from a table, but only if there are more than x number of duplicates? So it allows up to 2 or 3 duplicates, but that's it?

Currently I have a select statement that does the following:

delete table
from table t
left outer join (
 select max(id) as rowid, dupcol1, dupcol2
 from table
 group by dupcol1, dupcol2
) as keeprows on t.id=keeprows.rowid
where keeprows.rowid is null

This works great. But now what I'd like to do is only delete those rows if they have more than say 2 duplicates.

Thanks

4条回答
混吃等死
2楼-- · 2020-02-12 07:02

HAVING is your friend

select id, count(*) cnt from table group by id having cnt>2

查看更多
等我变得足够好
3楼-- · 2020-02-12 07:10

You can try the following query:

DELETE FROM table t1 
WHERE rowid IN
(SELECT MIN(rowid) FROM table t2 GROUP BY t2.id,t2.name HAVING COUNT(rowid)>3);
查看更多
你好瞎i
4楼-- · 2020-02-12 07:14

Quite late but Simplest solution could be as follows suppose we have table emp_dept(empid, deptid) which has duplicate rows, Here i have used @Count as varibale.. e.g. 2 duplicated allowed then @count = 2 On Oracle database

  delete from emp_dept where @Count <= ( select count(1) from emp_dept i where i.empid = emp_dept.empid and i.deptid = emp_dept.deptid and i.rowid < emp_dept.rowid ) 

On sql server or anydatabase which does not support row id kinda feature , we need to add identity column just to identify each row. say we have added nid as identity to the table

alter table emp_dept add nid int identity(1,1) -- to add identity column

now query to delete duplicate could be written as

  delete from emp_dept where @@Count <= ( select count(1) from emp_dept i where i.empid = emp_dept.empid and i.deptid = emp_dept.deptid and i.nid< emp_dept.nid ) 

Here the concept is delete all rows for which there exists other rows which have similar core values but n or greater number of smaller rowid or identity. Hence if there exists duplicate rows then one which has higher row id or identity will get deleted. and for row there isn't duplicate it fail in finding lower row id hence will not get deleted.

查看更多
可以哭但决不认输i
5楼-- · 2020-02-12 07:27
with cte as (
  select row_number() over (partition by dupcol1, dupcol2 order by ID) as rn
     from table)
delete from cte
   where rn > 2; -- or >3 etc

The query is manufacturing a 'row number' for each record, grouped by the (dupcol1, dupcol2) and ordered by ID. In effect this row number counts 'duplicates' that have the same dupcol1 and dupcol2 and assigns then the number 1, 2, 3.. N, order by ID. If you want to keep just 2 'duplicates', then you need to delete those that were assigned the numbers 3,4,.. N and that is the part taken care of by the DELLETE.. WHERE rn > 2;

Using this method you can change the ORDER BY to suit your preferred order (eg.ORDER BY ID DESC), so that the LATEST has rn=1, then the next to latest is rn=2 and so on. The rest stays the same, the DELETE will remove only the oldest ones as they have the highest row numbers.

Unlike this closely related question, as the condition becomes more complex, using CTEs and row_number() becomes simpler. Performance may be problematic still if no proper access index exists.

查看更多
登录 后发表回答