SQL: Why is distinct and max not removing duplicat

2019-09-08 01:31发布

问题:

SHouldn't the following query remove duplicates:

SELECT DISTINCT Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType
FROM            DimAccount AS ACC RIGHT OUTER JOIN
                             (SELECT DISTINCT PropertyID, MAX(TenancyStartDate) AS Tenancystart
                               FROM            DimAccount
                               WHERE        (AccountStatus = 'Current')
                               GROUP BY PropertyID, TenancyStartDate) AS Relevant ON ACC.PropertyID = Relevant.PropertyID AND ACC.TenancyStartDate = Relevant.Tenancystart
GROUP BY Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType, ACC.TenancyType

From my understanding (and what I want to happen) is, the query in brackets is selecting the property ID and of the ones with a status of current returning the highest tenancy start date (albeit several times). This is then joined to the original table by start date and property id, to get the most recent tenancytype.

Why is it still returning duplicate lines!?

(by the way this is relating to another question I had yesterday, but apparently replies are not supposed to descend into conversation so I thought I'd seperate this off... I hope that is the right thing to do... I have searched but clearly there is something missing in my understanding of something!)

回答1:

First, you almost never need select distinct when using group by.

The problem with your query is the group by clause in the subquery.

SELECT Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType
FROM DimAccount ACC RIGHT OUTER JOIN
     (SELECT PropertyID, MAX(TenancyStartDate) AS Tenancystart
      FROM  DimAccount
      WHERE (AccountStatus = 'Current')
      GROUP BY PropertyID
    ) Relevant
     ON ACC.PropertyID = Relevant.PropertyID AND
        ACC.TenancyStartDate = Relevant.Tenancystart
GROUP BY Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType;

It should not have TenancyStartDate. Also, your outer query had ACC.TenancyType twice in the group by.

That said, it is easier to write the query using analytic functions:

select a.*
from (select a.*,
             max(tenancystartdate) over (partition by propertyid) as max_tsd
      from dimaccount a
      where accountstatus = 'Current'
     ) a
where tenancystartdate = max_tsd;

This is not exactly the same as your query, because your query will take non-current records into account. I am guessing that this might be the intention, however.



回答2:

To answer your question: Yes, you are right, there can be no duplicates. And I am pretty sure there are none. I am also pretty sure that your query does not what you think it does.

This is your derived table:

SELECT DISTINCT PropertyID, MAX(TenancyStartDate) AS Tenancystart
FROM            DimAccount
WHERE        (AccountStatus = 'Current')
GROUP BY PropertyID, TenancyStartDate

As you group by PropertyID and TenancyStartDate, you get one line per PropertyID and TenancyStartDate. For each such line you want the MAX(TenancyStartDate), which is the TenancyStartDate itself of course. There is no other field you aggregate, so you don't aggregate at all, but only make the rows distinct, for which one would use DISTINCT. Then you do use DISTINCT to get unique result records, but your records are already unique, by your obfuscated way of doing it. So you say: select the distinct records of distinct records. Your subquery can be re-written as:

SELECT DISTINCT PropertyID, TenancyStartDate
FROM DimAccount
WHERE AccountStatus = 'Current'

Then you outer-join the DimAccount table. So you would keep your found records, even in case there is no matching DimAccount record. But: You've selected from DimAccount, so of course there is always at least the one record you already found. Your outer join is actually an inner join. Then the only field from the derived query shown is PropertyID which always equals ACC.PropertyID. This means: You are only selecting records from ACC and the derived table is just to make sure a 'Current' record exists for PropertyID and TenancyStartDate. Your query could thus be re-written as:

SELECT DISTINCT 
  PropertyID, TenancyStartDate, AccountID, TenancyType
FROM DimAccount AS ACC 
WHERE EXISTS
(
  SELECT *
  FROM DimAccount CurrentAccount
  WHERE CurrentAccount.AccountStatus = 'Current'
  AND CurrentAccount.PropertyID = ACC.PropertyID
  AND CurrentAccount.TenancyStartDate = ACC.TenancyStartDate
);

In case PropertyID + TenancyStartDate + AccountID + TenancyType are unique (is AccountID the table's ID?) then you can even remove DISTINCT.

This query gets all 'Current' DimAccount records first and then gives you all records with the same PropertyID and TenancyStartDate. However, from your explanation it seems you want to select the latest 'Current' DimAccount record per PropertyID. This is something entirely else. There are different solutions to such a task depending on the dbms you are using (you haven't specified yours in your tags).