可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am working with a table which is an extract of a set of other tables. All of the rows of the extract table should be unique according to keys D1, D2 and D3. They are not. It appears that an earlier developer attempted to solve this problem by using a SELECT DISTINCT across all columns being queried from this table. This will work, but only if every row which is a duplicate on (D1, D2, D3) is also a duplicate across the non-key columns (ignoring the IDENTITY column that was added to the extract table).

In other words, given rows as follows:

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X3

then

SELECT DISTINCT D1, D2, D3, C4, C5, C6
FROM BAD_TABLE

will "work", as there's no difference between the rows which are duplicated on (D1,D2,D3). But if the table contained

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X4

then SELECT DISTINCT would return two rows for the key (A,B,C). Furthermore, we would have to decide which of X3 or X4 was the "correct" value.

I know how to find the duplicates on (D1,D2,D3). I even know how to find the duplicates across all the columns (other than the IDENTITY column):

;
WITH DUPLICATES(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATES D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3
ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

The question is, how do I find the subset of the above resultset which are duplicates on (D1,D2,D3), but not duplicates on (D1,D2,D3,C4,C5,C6)?

回答1:

Any reason you don't just create another table expression to cover more fields and join to that one?

WITH DUPLICATEKEY(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
WITH NODUPES(D1,D2,D3,C4,C5,C6) AS
(
SELECT 
S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
GROUP BY
 S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
HAVING COUNT(*)=1
)

SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATEKEY D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3

INNER JOIN NODUPES D2
    ON S.D1 = D2.D1 AND S.D2 = D2.D2 AND S.D3 = D2.D3

ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

回答2:

You can do it by joining the table on itself, saying the D's are all equal and at least one of the C's are not equal.

CREATE TABLE #Source (
    D1 VARCHAR(2),
    D2 VARCHAR(2),
    D3 VARCHAR(2),
    C4 VARCHAR(2),
    C5 VARCHAR(2),
    C6 VARCHAR(2) );

INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X4');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');

SELECT S1.D1, S1.D2, S1.D3, S1.C4 C4_1, S2.C4 C4_2, S1.C5 C5_1, S2.C5 C5_2, S1.C6 C6_1, S2.C6 C6_2
FROM
    #Source S1
            INNER JOIN
    #Source S2
            ON
        (       S1.D1 = S2.D1 
            AND S1.D2 = S2.D2
            AND S1.D3 = S2.D3
            AND (   S1.C4 <> S2.C4
                 OR S1.C5 <> S2.C5
                 OR S1.C6 <> S2.C6
                 )
        );

DROP TABLE #Source;

Gives the following results:

D1   D2   D3   C4_1 C4_2 C5_1 C5_2 C6_1 C6_2
---- ---- ---- ---- ---- ---- ---- ---- ----
A    B    C    X1   X1   X2   X2   X4   X3
A    B    C    X1   X1   X2   X2   X3   X4

Also note that this is compatible with MS SQL 2000 as you later indicated is required in How to Convert a SQL Query using Common Table Expressions to One Without (for SQL Server 2000).

回答3:

I haven't had a chance to try Conrad's answer yet, but came up with one of my own. It's rather a "duh" moment.

So, if you want to find all the rows in set A except for those that are in set B, you use the EXCEPT operator:

; 
WITH KEYDUPLICATES(D1,D2,D3) AS 
( 
    SELECT D1, D2, D3 
    FROM SOURCE 
    GROUP BY D1, D2, D3 
    HAVING COUNT(*)>1 
),
KEYDUPLICATEROWS AS
( 
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S 
    INNER JOIN KEYDUPLICATES D 
        ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3 
),
FULLDUPLICATES AS
(
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S
    GROUP BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    HAVING COUNT(*)>1
)
SELECT KR.D1, KR.D2, KR.D3, KR.C4, KR.C5, KR.C6
FROM KEYDUPLICATEROWS AS KR
EXCEPT
SELECT FD.D1, FD.D2, FD.D3, FD.C4, FD.C5, FD.C6
FROM FULLDUPLICATES AS FD
ORDER BY D1, D2, D3, C4, C5, C6

This seems to be showing me 1500 rows which are duplicates across (D1,D2,D3), but which are only duplicates across a subset of (D1,D2,D3,C4,C5,C6). In fact, it appears they are duplicates across (D1,D2,D3,C4,C5).

How to confirm that will be the subject of another question.

回答4:

This would have performance limitations, but is much easier to understand:

SELECT D1, D2, D3
FROM TEST_KEY TK
WHERE (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3
         HAVING COUNT(*) > 1)
  AND (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3, C4, C5, C6
         HAVING COUNT(*) < 2)

Unable to test on SQL-Server, hope the syntax is good.

Again, not sure if you have analytic functions in SQL-Server, but this one works in Oracle and might be faster:

WITH BAD_DUP AS (
SELECT TK.*,
       COUNT(1) OVER (PARTITION BY D1, D2, D3, C4, C5, C6 ORDER BY D1) FULL_DUP,
       COUNT(1) OVER (PARTITION BY D1, D2, D3 ORDER BY D1) KEY_DUP
FROM TEST_KEY TK)
SELECT * FROM BAD_DUP
WHERE FULL_DUP < KEY_DUP

Would like to get it down to a single query....

回答5:

I know this is an old question, but I saw activity on the question and the technique I always use for these is not presented here as an answer, and it's really quite simple, so I figured I'd present it.

SELECT D1, D2, D3, MIN(C4), MAX(C4), MIN(C5), MAX(C5), MIN(C6), MAX(C6)
FROM BAD_TABLE
GROUP BY D1, D2, D3
HAVING MIN(C4) <> MAX(C4)
    OR MIN(C5) <> MAX(C5)
    OR MIN(C6) <> MAX(C6)

This will show all the keys of duplicates on key but with differences on non-keys, with the range of differences duplicates.

To see all the rows within that, you would need to join back to BAD_TABLE as your example in the original question.