How to delete duplicate rows in sql server?

2018-12-31 09:48发布

How can I delete duplicate rows where no unique row id exists?

My table is

col1  col2 col3 col4 col5 col6 col7
john  1    1    1    1    1    1 
john  1    1    1    1    1    1
sally 2    2    2    2    2    2
sally 2    2    2    2    2    2

I want to be left with the following after the duplicate removal:

john  1    1    1    1    1    1
sally 2    2    2    2    2    2

I've tried a few queries but i think they depend on a row id as I don't get desired result. For example:

DELETE FROM table WHERE col1 IN (
    SELECT id FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)

15条回答
伤终究还是伤i
2楼-- · 2018-12-31 10:28
-- this query will keep only one instance of a duplicate record.
;WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
                                       ORDER BY ( SELECT 0)) RN
         FROM   Mytable)



delete  FROM cte
WHERE  RN > 1
查看更多
几人难应
3楼-- · 2018-12-31 10:29

After trying the suggested solution above, that works for small medium tables. I can suggest that solution for very large tables. since it runs in iterations.

  1. Drop all dependency views of the LargeSourceTable
  2. you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
  3. Rename the table:
  4. sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
  5. Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
  6. For example:

    CREATE TABLE [dbo].[LargeSourceTable] ( ID int IDENTITY(1,1), [CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL, [Column1] CHAR (36) NOT NULL, [Column2] NVARCHAR (100) NOT NULL, [Column3] CHAR (36) NOT NULL, PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON) ); GO

  7. Create again the views that you dropped in the first place for the new created table

  8. Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.

  9. Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying

SET IDENTITY_INSERT LargeSourceTable ON DECLARE @PageNumber AS INT, @RowspPage AS INT DECLARE @TotalRows AS INT declare @dt varchar(19) SET @PageNumber = 0 SET @RowspPage = 1000000 select @TotalRows = count (*) from LargeSourceTable_TEMP

While ((@PageNumber - 1) * @RowspPage < @TotalRows )
Begin
    begin transaction tran_inner
        ; with cte as
        (
            SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
            OFFSET ((@PageNumber) * @RowspPage) ROWS
            FETCH NEXT @RowspPage ROWS ONLY
        )

        INSERT INTO LargeSourceTable 
        (
             ID                     
            ,[CreateDate]       
            ,[Column1]   
            ,[Column2] 
            ,[Column3]       
        )       
        select 
             ID                     
            ,[CreateDate]       
            ,[Column1]   
            ,[Column2] 
            ,[Column3]       
        from cte

    commit transaction tran_inner

    PRINT 'Page: ' + convert(varchar(10), @PageNumber)
    PRINT 'Transfered: ' + convert(varchar(20), @PageNumber * @RowspPage)
    PRINT 'Of: ' + convert(varchar(20), @TotalRows)

    SELECT @dt = convert(varchar(19), getdate(), 121)
    RAISERROR('Inserted on: %s', 0, 1, @dt) WITH NOWAIT
    SET @PageNumber = @PageNumber + 1
End

SET IDENTITY_INSERT LargeSourceTable OFF

查看更多
临风纵饮
4楼-- · 2018-12-31 10:31

I would prefer CTE for deleting duplicate rows from sql server table

strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

by keeping original

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

without keeping original

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
查看更多
不流泪的眼
5楼-- · 2018-12-31 10:34

Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444

In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:

SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;

myprimarykey is the identifier for the row.

I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.

Hope it helps anyone

查看更多
伤终究还是伤i
6楼-- · 2018-12-31 10:36

With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server

The idea of removing duplicate involves

  • a) Protecting those rows that are not duplicate
  • b) Retain one of the many rows that qualified together as duplicate.

Step-by-step

  • 1) First identify the rows those satisfy the definition of duplicate and insert them into temp table, say #tableAll .
  • 2) Select non-duplicate(single-rows) or distinct rows into temp table say #tableUnique.
  • 3) Delete from source table joining #tableAll to delete the duplicates.
  • 4) Insert into source table all the rows from #tableUnique.
  • 5) Drop #tableAll and #tableUnique
查看更多
几人难应
7楼-- · 2018-12-31 10:40

Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is and example

DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
查看更多
登录 后发表回答