What's the best way to dedupe a table?

2019-01-21 19:33发布

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

14条回答
甜甜的少女心
2楼-- · 2019-01-21 19:51

For deduplicate / dedupe / remove duplication / remove repeated rows / 数据库 除重 / 数据库 去除 重复记录, there are multiple ways.

  1. If duplicated rows are exact the same, use group by

    create table TABLE_NAME_DEDUP
    as select column1, column2, ... (all column names) from TABLE_NAME group by column1, column2, -- all column names

Then TABLE_NAME_DEDUP is the deduplicated table.

For example,

create table test (t1 varchar(5), t2 varchar(5));
insert into test  values ('12345', 'ssdlh');
insert into test  values ('12345', 'ssdlh');
create table test_dedup as
select * from test 
group by t1, t2;
-----optional
--remove original table and rename dedup table to previous table
--this is not recommend in dev or qa. DROP table test; Alter table test_dedup rename to test;
  1. You have a rowid, the rowid has duplication but other columns are different Records partial same, this may happened in a transactional system while update a row, and the rows failed to update will have nulls. You want to remove the duplication

    create table test_dedup as select column1, column2, ... (all column names) from ( select * , row_number() over (partition by rowid order by column1, column2, ... (all column names except rowid) ) as cn from test ) where cn =1

This is using the feature that when you use order by, the null value will be ordered behind the non-null value.

create table test (rowid_ varchar(5), t1 varchar(5), t2 varchar(5));
insert into test  values ('12345', 'ssdlh', null);
insert into test  values ('12345', 'ssdlh', 'lhbzj');
create table test_dedup as
select rowid_, t1, t2 from
(select *
  , row_number() over (partition by rowid_ order by t1, t2) as cn
  from  test)
 where cn =1
 ;

-----optional
--remove original table and rename dedup table to previous table
--this is not recommend in dev or qa. DROP table test; Alter table test_dedup rename to test;
查看更多
爷的心禁止访问
3楼-- · 2019-01-21 19:57

I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date.

In this example say I have 3 records all with the same app_id, and I only want to keep the one with the highest date:

DELETE t
FROM @USER_OUTBOX_APPS t
INNER JOIN  
(
    SELECT 
         app_id
        ,max(processed_date) as max_processed_date
    FROM @USER_OUTBOX_APPS
    GROUP BY app_id
    HAVING count(*) > 1
) t2 on 
    t.app_id = t2.app_id
WHERE 
    t.processed_date < t2.max_processed_date
查看更多
Juvenile、少年°
4楼-- · 2019-01-21 19:57

For those of you who prefer a quick and dirty approach, just list all the columns that together define a unique record and create a unique index with those columns, like so:

ALTER IGNORE TABLE TABLE_NAME ADD UNIQUE (column1,column2,column3)

You can drop the unique index afterwords.

查看更多
来,给爷笑一个
5楼-- · 2019-01-21 19:59

Here's one I've run into, in real life.

Assume you have a table of external/3rd party logins for users, and you're going to merge two users and want to dedupe on the provider/provider key values.

    ;WITH Logins AS
    (
        SELECT [LoginId],[UserId],[Provider],[ProviderKey]
        FROM [dbo].[UserLogin] 
        WHERE [UserId]=@FromUserID -- is the user we're deleting
              OR [UserId]=@ToUserID -- is the user we're moving data to
    ), Ranked AS 
    (
        SELECT Logins.*
            , [Picker]=ROW_NUMBER() OVER (
                       PARTITION BY [Provider],[ProviderKey]
                       ORDER BY CASE WHEN [UserId]=@FromUserID THEN 1 ELSE 0 END)
        FROM Logins
    )
    MERGE Logins AS T
    USING Ranked AS S
    ON S.[LoginId]=T.[LoginID]
    WHEN MATCHED AND S.[Picker]>1 -- duplicate Provider/ProviderKey
                 AND T.[UserID]=@FromUserID -- safety check 
    THEN DELETE
    WHEN MATCHED AND S.[Picker]=1 -- the only or best one
                 AND T.[UserID]=@FromUserID
    THEN UPDATE SET T.[UserID]=@ToUserID
    OUTPUT $action, DELETED.*, INSERTED.*;
查看更多
Melony?
6楼-- · 2019-01-21 20:02

This can dedupe the duplicated values in c1:

select * from foo
minus
select f1.* from foo f1, foo f2
where f1.c1 = f2.c1 and f1.c2 > f2.c2
查看更多
相关推荐>>
7楼-- · 2019-01-21 20:07

For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.

查看更多
登录 后发表回答