SQL Remove almost duplicate rows

2020-05-28 22:57发布

问题:

I have a table that contains unfortuantely bad data and I'm trying to filter some out. I am sure that the LName, FName combonation is unique since the data set is small enough to verify.

LName, FName, Email
-----  -----  -----
Smith  Bob    bsmith@example.com
Smith  Bob    NULL
Doe    Jane   NULL
White  Don    dwhite@example.com

I would like to have the query results bring back the "duplicate" record that does not have a NULL email, yet still bring back a NULL Email when there is not a duplicate.

E.g.

Smith Bob   bsmith@example.com
Doe   Jane  NULL
White Don   dwhite@example.com

I think the solution is similar to Sql, remove duplicate rows by value, but I don't really understand if the asker's requirements are the same as mine.

Any suggestions?

Thanks

回答1:

This drops the null rows if there are any non null values.

SELECT  lname
        , fname
        , MIN(email)
FROM    YourTable
GROUP BY
        lname
        , fname

Test script

DECLARE @Test TABLE (
  LName VARCHAR(32)
  , FName VARCHAR(32)
  , Email VARCHAR(32)
)

INSERT INTO @Test
  SELECT 'Smith', 'Bob', 'bsmith@example.com'
  UNION ALL SELECT 'Smith', 'Bob', 'NULL'
  UNION ALL SELECT 'Doe', 'Jane', 'NULL'
  UNION ALL SELECT 'White', 'Don', 'dwhite@example.com'

SELECT  lname
        , fname
        , MIN(Email)        
FROM    @Test
GROUP BY
        lname
        , fname


回答2:

You can use ROW_NUMBER() analytic function:

SELECT *
  FROM (
                SELECT a.*, ROW_NUMBER() OVER(PARTITION BY LName, FName ORDER BY Email DESC) rnk
                    FROM <YOUR_TABLE> a
                ) a
WHERE RNK = 1


回答3:

Here is a relatively simple query that uses standard SQL and does just this:

SELECT * FROM Person P
WHERE Email IS NOT NULL OR -- Take all people with non-null e-mails
      Email IS NULL AND    -- and all people with null e-mails, as long as
        NOT EXISTS         -- there is no duplicate record of the same person
          (SELECT *        -- with a non-null e-mail
           FROM Person P2 
           WHERE P2.LName=P.LName AND P2.FName=P.FName AND P2.Email IS NOT NULL)


回答4:

Since there are plenty of SQL solutions posted already, you may want to create a data fix to remove the bad data, then add the necessary constraints to prevent bad data from ever being inserted. Bad data in a database is a side effect of poor design.