What is the best way to remove duplicate rows from a fairly large SQL Server
table (i.e. 300,000+ rows)?
The rows, of course, will not be perfect duplicates because of the existence of the RowID
identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
I thought I'd share my solution since it works under special circumstances. I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Use this
Another possible way of doing this is
I am using
ORDER BY (SELECT 0)
above as it is arbitrary which row to preserve in the event of a tie.To preserve the latest one in
RowID
order for example you could useORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the
GROUP BY
solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.The
ROW_NUMBER
solution will always give pretty much the same plan whereas theGROUP BY
strategy is more flexible.Factors which might favour the hash aggregate approach would be
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then
TRUNCATE
-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.