Getting duplicates with additional information

2019-03-04 07:51发布

I've inherited a database and I'm having trouble constructing a working SQL query.

Suppose this is the data:

[Products]

| Id    | DisplayId     | Version   | Company   | Description   |
|----   |-----------    |---------- |-----------| -----------   |
| 1     | 12345         | 0         | 16        | Random        |
| 2     | 12345         | 0         | 2         | Random 2      |
| 3     | AB123         | 0         | 1         | Random 3      |
| 4     | 12345         | 1         | 16        | Random 4      |
| 5     | 12345         | 1         | 2         | Random 5      |
| 6     | AB123         | 0         | 5         | Random 6      |
| 7     | 12345         | 2         | 16        | Random 7      |
| 8     | XX45          | 0         | 5         | Random 8      |
| 9     | XX45          | 0         | 7         | Random 9      |
| 10    | XX45          | 1         | 5         | Random 10     |
| 11    | XX45          | 1         | 7         | Random 11     |


[Companies]

| Id    | Code      |
|----   |-----------|
| 1     | 'ABC'     |
| 2     | '456'     |
| 5     | 'XYZ'     |
| 7     | 'XYZ'     |
| 16    | '456'     |

The Versioncolumn is a version number. Higher numbers indicate more recent versions. The Company column is a foreign key referencing the Companies table on the Id column. There's another table called ProductData with a ProductId column referencing Products.Id.

Now I need to find duplicates based on the DisplayId and the corresponding Companies.Code. The ProductData table should be joined to show a title (ProductData.Title), and only the most recent ones should be included in the results. So the expected results are:

| Id    | DisplayId     | Version   | Company   | Description   | ProductData.Title |
|----   |-----------    |---------- |-----------|-------------  |------------------ |
| 5     | 12345         | 1         | 2         | Random 2      | Title 2           |
| 7     | 12345         | 2         | 16        | Random 7      | Title 7           |
| 10    | XX45          | 1         | 5         | Random 10     | Title 10          |
| 11    | XX45          | 1         | 7         | Random 11     | Title 11          |
  • because XX45 has 2 "entries": one with Company 5 and one with Company 7, but both companies share the same code.
  • because 12345 has 2 "entries": one with Company 2 and one with Company 16, but both companies share the same code. Note that the most recent version of both differs (version 2 for company 16's entry and version 1 for company 2's entry)
  • ABC123 should not be included as its 2 entries have different company codes.

I'm eager to learn your insights...

4条回答
何必那么认真
2楼-- · 2019-03-04 08:12

If i understood you correctly, you can use CTE to find all the duplicated rows from your table, then you can just use SELECT from CTE and even add more manipulations.

WITH CTE AS(
   SELECT Id,DisplayId,Version,Company,Description,ProductData.Title
       RN = ROW_NUMBER()OVER(PARTITION BY DisplayId, Company ORDER BY p.Id DESC)
   FROM dbo.YourTable1
)

SELECT *
FROM CTE
查看更多
Fickle 薄情
3楼-- · 2019-03-04 08:16

Based on your sample data, you just need to JOIN the tables:

  SELECT 
    p.Id, p.DisplayId, p.Version, p.Company, d.Title
  FROM Products AS p
  INNER JOIN Companies AS c ON p.Company = c.Id
  INNER JOIN ProductData AS d ON d.ProductId = p.Id;

But if you want the latest one, you can use the ROW_NUMBER():

WITH CTE
AS
(
  SELECT 
    p.Id, p.DisplayId, p.Version, p.Company, d.Title,
    ROW_NUMBER() OVER(PARTITION BY p.DisplayId,p.Company ORDER BY p.Id DESC) AS RN
  FROM Products AS p
  INNER JOIN Companies AS c ON p.Company = c.Id
  INNER JOIN ProductData AS d ON d.ProductId = p.Id
)
SELECT * 
FROM CTE
WHERE RN = 1;

sample fiddle

| Id | DisplayId | Version | Company |    Title |
|----|-----------|---------|---------|----------|
|  5 |     12345 |       1 |       2 |  Title 5 |
|  7 |     12345 |       2 |      16 |  Title 7 |
| 10 |      XX45 |       1 |       5 | Title 10 |
| 11 |      XX45 |       1 |       7 | Title 11 |
查看更多
时光不老,我们不散
4楼-- · 2019-03-04 08:28

You have to first get the current version and then you see how many times the DisplayID + Code show-up. Then based on that you can select only the ones that have a count greater than one. You can then INNER JOIN ProductData on the final query to get the Title.

WITH
MaxVersion AS --Get the current versions
(
    SELECT
        MAX(Version) AS Version,
        DisplayID,
        Company
    FROM
        #TmpProducts
    GROUP BY
        DisplayID,
        Company
)
,CTE AS
(
    SELECT
        p.DisplayID,
        c.Code,
        COUNT(*) AS RowCounter
    FROM
        #TmpProducts p
    INNER JOIN
        #TmpCompanies c
        ON
            c.ID = p.Company
    INNER JOIN
        MaxVersion mv
        ON
            mv.DisplayID = p.DisplayID
        AND mv.Version = p.Version
        AND mv.Company = p.Company
    GROUP BY
        p.DisplayID,
        c.Code
)

SELECT 
    p.*
FROM
    #TmpProducts p
INNER JOIN
    CTE c
    ON
        c.DisplayID = p.DisplayID
INNER JOIN
    MaxVersion mv
    ON
        mv.DisplayID = p.DisplayID
    AND mv.Company = p.Company
    AND mv.Version = p.Version
WHERE
    c.RowCounter > 1
查看更多
霸刀☆藐视天下
5楼-- · 2019-03-04 08:31

Try this:

SELECT b.ID,displayid,version,company,productdata.title
FROM 
(select A.ID,a.displayid,version,a.company,rn,a.code, COUNT(displayid)  over (partition by displayid,code) cnt from
(select Prod.ID,displayid,version,company,Companies.code, Row_number() over (partition by displayid,company order by version desc) rn
from Prod inner join Companies on Prod.Company = Companies.id) a  
where a.rn=1) b inner join productdata on b.id = productdata.id  where cnt =2
查看更多
登录 后发表回答