Best self join technique when checking for duplica

2020-06-04 01:57发布

问题:

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.

select * from table t1 
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...

What would be a better way to do this? Or is it all the same based on indexes? Maybe

select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...

this table has 100m+ rows.

MS SQL, SQL Server 2008 Enterprise

select distinct t2.id
    from table1 t1 with (nolock)
    inner join table1 t2 with (nolock) on  t1.ckid=t2.ckid
    left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
    where 
    t2.id > @Max_id and
    t2.timestamp > t1.timestamp and
    t2.rid = 2 and
    isnull(t1.col1,'') = isnull(t2.col1,'') and 
    isnull(t1.cid,-1) = isnull(t2.cid,-1) and
    isnull(t1.rid,-1) = isnull(t2.rid,-1)and 
    isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
    isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
    isnull(t1.oid,'') = isnull(t2.oid,'') and
    isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)  

    and (
            (
                t3.uniqueoid = 1
            )
            or
            (
                t3.uniqueoid is null and 
                isnull(t1.col1,'') = isnull(t2.col1,'') and 
                isnull(t1.col2,'') = isnull(t2.col2,'') and
                isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and 
                isnull(t1.stid,-1) = isnull(t2.stid,-1) and
                isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
                isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
                isnull(t1.col3,-1) = isnull(t2.col3,-1) 
            )
    )

回答1:

Why self join: this is an aggregate question.

Hope you have an index on col1, col2, ...

--DELETE table
--WHERE KeyCol NOT IN (
select
    MIN(KeyCol) AS RowToKeep,
    col1, col2, 
from
    table
GROUP BY
    col12, col2
HAVING
   COUNT(*) > 1
--)

However, this will take some time. Have a look at bulk delete techniques



回答2:

You can use ROW_NUMBER() to find duplicate rows in one table.

You can check here



回答3:

The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.

And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.

You might want to try something more like:

select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1


回答4:

For table with 100m+ rows, Using GROUPBY functions and using holding table will be optimized. Even though it translates into four queries.

STEP 1: create a holding key:

SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1

STEP 2: Push all the duplicate entries into the holddups. This is required for Step 4.

SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

STEP 3: Delete the duplicate rows from the original table.

DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

STEP 4: Put the unique rows back in the original table. For example:

INSERT t1 SELECT * FROM holddups


回答5:

To detect duplicates, you don't need to join:

SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1

That should be much faster.



回答6:

In my experience, SQL Server performance is really bad with OR conditions. Probably it is not the self join but that with table3 that causes the bad performance. But without seeing the plan, I would not be sure.

In this case, it might help to split your query into two: One with a WHERE condition t3.uniqueoid = 1 and one with a WHERE condition for the other conditons on table3, and then use UNION ALL to append one to the other.