Efficient query for finding duplicate records

2019-06-08 11:29发布

问题:

I need to query a table for duplicate deposit records, where two deposits made at one cash terminal, for the same amount, within a certain time window, are considered duplicate records. I've started working on a query now, but I would appreciate any advice or suggestions on doing this 'properly'.

回答1:

Generally, you'd do a self join to the same table, and put your "duplicate" criteria in the join conditions.

E.g.

SELECT
    *
FROM
    Transactions t1
        inner join
    Transactions t2
        on
            t1.Terminal = t2.Terminal and
            t1.Amount = t2.Amount and
            DATEDIFF(minute,t2.TransactionDate,t1.TransactionDate) between 0 and 10 and
            t1.TransactionID > t2.TransactionID /* prevent matching the same row */


回答2:

Simple aggregate

SELECT
   col1, col2, col3, ...
FROM
   MyTable
GROUP BY
   col1, col2, col3, ...
HAVING
   COUNT(*) >= 2

Don't include your identity/key/PK column: this will be unique per row and mess up the aggregate.

To get a row to remove or keep, do a MAX or MIN on that

SELECT
   col1, col2, col3, ...,
   MAX(IDCol) AS RowToDelete,
   MIN(IDCol) AS RowToKeep
FROM
   MyTable
GROUP BY
   col1, col2, col3, ...
HAVING
   COUNT(*) >= 2

Of course, with 3 duplicates then do a "keep".

Edit:

For rows within a time window, use a self join or window/ranking function