COUNT (DISTINCT column_name) Discrepancy vs. COUNT

2020-05-21 06:41发布

问题:

I'm running into a problem that's driving me nuts. When running the query below, I get a count of 233,769

 SELECT COUNT(distinct  Member_List_Link.UserID)  
 FROM Member_List_Link  with (nolock)   
 INNER JOIN MasterMembers with (nolock)  
     ON Member_List_Link.UserID = MasterMembers.UserID   
  WHERE MasterMembers.Active = 1 And
        Member_List_Link.GroupID = 5 AND 
        MasterMembers.ValidUsers = 1 AND 
        Member_List_Link.Status = 1

But if I run the same query without the distinct keyword, I get a count of 233,748

 SELECT COUNT(Member_List_Link.UserID)  
 FROM Member_List_Link  with (nolock)   
 INNER JOIN MasterMembers with (nolock)
   ON Member_List_Link.UserID = MasterMembers.UserID   
 WHERE MasterMembers.Active = 1 And Member_List_Link.GroupID = 5 
  AND MasterMembers.ValidUsers = 1 AND Member_List_Link.Status = 1

To test, I recreated all the tables and place them into temp tables and ran the queries again:

  SELECT COUNT(distinct  #Temp_Member_List_Link.UserID)  
  FROM #Temp_Member_List_Link  with (nolock)   
  INNER JOIN #Temp_MasterMembers with (nolock)
    ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID   
  WHERE #Temp_MasterMembers.Active = 1 And 
        #Temp_Member_List_Link.GroupID = 5 AND 
        #Temp_MasterMembers.ValidUsers = 1 AND 
        #Temp_Member_List_Link.Status = 1

And without the distinct keyword

  SELECT COUNT(#Temp_Member_List_Link.UserID)  
  FROM #Temp_Member_List_Link  with (nolock)   
  INNER JOIN #Temp_MasterMembers with (nolock)
    ON #Temp_Member_List_Link.UserID = #Temp_MasterMembers.UserID   
  WHERE #Temp_MasterMembers.Active = 1 And 
        #Temp_Member_List_Link.GroupID = 5 AND 
        #Temp_MasterMembers.ValidUsers = 1 AND 
        #Temp_Member_List_Link.Status = 1

On a side note, I recreated the temp tables by simply running (select * from Member_List_Link into #temp...)

And now when I check to see the difference between COUNT(column) vs. COUNT(distinct column) with these temp tables, I don't see any!

So why is there a discrepancy with the original tables?

I'm running SQL Server 2008 (Dev Edition).

UPDATE - Including statistics profile

PhysicalOp column only for the first query (without distinct)

NULL
Compute Scalar
Stream Aggregate
Clustered Index Seek

PhysicalOp column only for the first query (with distinct)

NULL
Compute Scalar
Stream Aggregate
Parallelism
Stream Aggregate
Hash Match
Hash Match
Bitmap
Parallelism
Index Seek
Parallelism
Clustered Index Scan

Rows and Executes for the 1st query (without distinct)

1   1
0   0
1   1
1   1

Rows and Executes for the 2nd query (with distinct)

Rows    Executes
1   1
0   0
1   1
16  1
16  16
233767  16
233767  16
281901  16
281901  16
281901  16
234787  16
234787  16

Adding OPTION(MAXDOP 1) to the 2nd query (with distinct)

Rows Executes

1           1
0           0
1           1
233767          1
233767          1
281901          1
548396          1

And the resulting PhysicalOp

NULL
Compute Scalar
Stream Aggregate
Hash Match
Hash Match
Index Seek
Clustered Index Scan

回答1:

FROM http://msdn.microsoft.com/en-us/library/ms187373.aspx NOLOCK Is equivalent to READUNCOMMITTED. For more information, see READUNCOMMITTED later in this topic.

READUNCOMMITED will read rows twice if they are the subject of a transation- since both the roll foward and roll back rows exist within the database when the transaction is IN process.

By default all queries are read committed which excludes uncommitted rows

When you insert into a temp table the select will give you only committed rows - I believe this covers all the symptoms you are trying to explain



回答2:

I think i have got the answer to your question but tell me first is userid a primary key in your original table ?

if yes,then CTAS query to create temp table would not copy any primary key of original table ,it only copy NOT NULL constraint that is not a part of primary key..fine?

now what happened your original table had a primary key so count(distinct column_name) doesnt include tuples with null records and while you created temp tables , primary key doesnt get copied and hence the NOT NULL constraint doesnt get to the temp table!!

is that clear to you?



回答3:

It's hard to reproduce this behaviour, so I'm punching in the dark here:

The WITH (NOLOCK) statement enables reading of uncommitted data. I'm guessing you've added that to not lock anything for your users? If you remove those and issue a

SET TRANSACTION ISOLATION LEVEL READ COMMITTED

Prior to executing the query, you should get more reliable results. But then, the tables may receive locks while executing the query.

If that doesn't work, my guess is that DISTINCT use an index to optimize. Check the queryplan, and rebuild indexes as necessary. Could be the source of your problem.



回答4:

What result do you get with

SELECT count(*) FROM (
    SELECT distinct  Member_List_Link.UserID
    FROM Member_List_Link  with (nolock)
    INNER JOIN MasterMembers with (nolock)
      ON Member_List_Link.UserID = MasterMembers.UserID
    WHERE MasterMembers.Active = 1 And
         Member_List_Link.GroupID = 5 AND 
         MasterMembers.ValidUsers = 1 AND
         Member_List_Link.Status = 1
) as m

AND WITH:

SELECT count(*) FROM (
    SELECT distinct  Member_List_Link.UserID
    FROM Member_List_Link  
    INNER JOIN MasterMembers
      ON Member_List_Link.UserID = MasterMembers.UserID
    WHERE MasterMembers.Active = 1 And
         Member_List_Link.GroupID = 5 AND 
         MasterMembers.ValidUsers = 1 AND
         Member_List_Link.Status = 1
) as m


回答5:

Ray, please try the following

SELECT COUNT(*)
FROM 
(
    SELECT Member_List_Link.UserID, ROW_NUMBER() OVER (PARTITION BY Member_List_Link.UserID ORDER BY (SELECT NULL)) N
    FROM Member_List_Link  with (nolock)   
    INNER JOIN MasterMembers with (nolock)  
        ON Member_List_Link.UserID = MasterMembers.UserID   
     WHERE MasterMembers.Active = 1 And
           Member_List_Link.GroupID = 5 AND 
           MasterMembers.ValidUsers = 1 AND 
           Member_List_Link.Status = 1
) A
WHERE N = 1


回答6:

when you use count with distinct column it doesn't count columns having values null.

create table #tmp(name char(4) null)

insert into #tmp values(null)

insert into #tmp values(null)

insert into #tmp values("AAA")

Query:- 1> select count(*) from #tmp 2> go


       3

1> select count(distinct name) from #tmp 2> go


       1

1> select distinct name from #tmp 2> go name


NULL

AAA

but it works in derived table

1> select count(*) from ( select distinct name from #tmp) a

2> go


       2

Note:- I tested it in Sybase