Deleting duplicate row that has earliest date

2019-02-27 19:11发布

问题:

I have a table called PF_temo that has the following structure:

  • firstname
  • middlename
  • lastname
  • DOB
  • address
  • city
  • state
  • phone
  • validitydate

It has many rows that are identical, except for the validity date. For example:

steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201609
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201002
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201706
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,199812

I'd like to run a script that deletes all duplicates matched on everything but the last column (validitydate) leaving just the below in the table, which is the most recent validitydate of 201706:

steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201706

This is what I have; though it throws an exception:

DELETE 
FROM PF_temp
LEFT OUTER JOIN 
(
    SELECT Min(ValidityDate) as RowId
    , firstname
    , middlename
    , lastname
    , DOB
    , address
    , city
    , state 
    , phone
    FROM PF_temp 
    GROUP BY firstname
    , middlename
    , lastname
    , DOB
    , address
    , city
    , state 
    , phone
    , validitydate 
) as KeepRows 
ON TableName.RowId = KeepRows.RowId
WHERE KeepRows.RowId IS NULL

It doesn't work and actually pops this error:

Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'OUTER'.
Msg 156, Level 15, State 1, Line 4
Incorrect syntax near the keyword 'as'.

Also, I'd like to run it in stages based on the last letter of the alphabet. So something like WHERE lastname like 'A%' needs to be added somewhere.

回答1:

Try this:

delete a
from PF_Temp a
inner join PF_Temp b 
on  b.firstname = a.firstname 
and b.middlename = a.middlename
and b.lastname = a.lastname
and b.DOB = a.DOB
and b.address = a.address
and b.city = a.city
and b.state = a.state
and b.phone = a.phone
and b.validitydate > a.validitydate

Example at SQL Fiddle.

The above works by:

  • joining on all matching fields (except validity date), thus capturing in a all records which have duplicates. At this stage we capture all records, since the record in a would match with itself in b.
  • By specifying that the validitydate in b must be greater than that in a we both avoid the above issue of the record being the same (since if it were the same record, the validity date would be the same), and also ensuring that there's no match if the record in a is the most recent; since there will be no match in b (i.e. no record with a greater validity date).
  • we then delete every record which was returned by a; i.e. every record which has a duplicate with a later validity date.

If you want to only delete those duplicates with a specific last name, you do exactly what you said above; i.e. add the line where a.LastName like 'A%'.


Update

You mention that some columns may contain nulls. Here's a revised version of the above to take into account that null != null.

delete a
from PF_Temp a
inner join PF_Temp b 
on  ((b.firstname = a.firstname) or (b.firstname is null and a.firstname is null))
and ((b.middlename = a.middlename) or (b.middlename is null and a.middlename is null))
and ((b.lastname = a.lastname) or (b.lastname is null and a.lastname is null))
and ((b.DOB = a.DOB) or (b.DOB is null and a.DOB is null))
and ((b.address = a.address) or (b.address is null and a.address is null))
and ((b.city = a.city) or (b.city is null and a.city is null))
and ((b.state = a.state) or (b.state is null and a.state is null))
and ((b.phone = a.phone) or (b.phone is null and a.phone is null))
and b.validitydate > a.validitydate

An alternative to the above would be on coalesce(b.firstname,'') = coalesce(a.firstname) (repeating that pattern for all other matching fields); though that would mean that nulls and blanks were treated the same, and wouldn't perform quite so well.


Alternative Method

A different approach, which is more forgiving of nulls, is to use a subquery to pull back all values, numbering each set with matching values, starting at 1 for the most recent validity date. We then delete all those rows which came back with numbers higher than 1; i.e. any which are duplicates with earlier validity dates.

delete TheDeletables
from 
(
    select *
    , row_number() over (
        partition by 
         firstname 
        , middlename 
        , lastname 
        , DOB 
        , address  
        , city 
        , state 
        , phone 
        order by validitydate desc
    ) rowid
    from PF_Temp
) TheDeletables
where rowid > 1;

Demo SQL Fiddle.