The difference between countDistinct and distinct.

2019-02-27 22:53发布

问题:

Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count? Is the difference the same as between select count(distinct member_id) and select distinct count(member_id)?

回答1:

df.agg(countDistinct("member_id") as "count")

returns the number of distinct values of the member_id column, ignoring all other columns, while

df.distinct.count

will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.

So, for example, the DataFrame:

+-----------+---------+
|member_name|member_id|
+-----------+---------+
|          a|        1|
|          b|        1|
|          b|        1|
+-----------+---------+

has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.



回答2:

Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?

Because .distinct.count is the same:

SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)

while ..agg(countDistinct("member_id") as "count") is

SELECT COUNT(DISTINCT member_id) FROM table

and COUNT(*) uses different rules than COUNT(column) when nulls are encountered.



回答3:

1st command :

DF.agg(countDistinct("member_id") as "count")

return the same as that of select count distinct(member_id) from DF.

2nd command :

DF.distinct.count

is actually getting distinct records or removing al duplicates from the DF and then taking the count.