I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.
My table looks like the following and has 108
Mio rows:
Col1 |Col2 | Col3 |Col4 |SICComb | NameComb
Case New |3523 | Alexander |6799 |67993523| AlexanderCase New
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
Undisclosed|6799 | Case New |3523 |67993523| Case NewUndisclosed
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
SmartCard |3674 | Virtual NetComm|7373 |73733674| SmartCardVirtual NetComm
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
The unique columns are SICComb
and NameComb
. I tried to add a primary key with:
ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)
but the integers are filling up more than 30
GB of my storage just in a new minutes.
Which would be the fastest and most efficient method to delete the duplicates from the table?
If you're using SQL Server, you can use delete from common table expression:
Here all rows will be numbered, you get own sequence for each unique combination of
SICComb
+NameComb
. You can choose which rows you want to delete by choosingorder by
inside theover
clause.In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.
Here is the idea, using SQL Server syntax:
Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.
Note that the syntax for creating the temporary table differs among databases. Some use the syntax of
create table as
rather thanselect into
.EDIT:
Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:
If you have an identity column, then there are no duplicates (by definition).
In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:
If you need the old identity value (and the minimum will do), turn off identity insert and do: