可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count? Is the difference the same as between select count(distinct member_id) and select distinct count(member_id)?

回答1:

df.agg(countDistinct("member_id") as "count")

returns the number of distinct values of the member_id column, ignoring all other columns, while

df.distinct.count

will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.

So, for example, the DataFrame:

+-----------+---------+
|member_name|member_id|
+-----------+---------+
|          a|        1|
|          b|        1|
|          b|        1|
+-----------+---------+

has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.

回答2:

Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?

Because .distinct.count is the same:

SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)

while ..agg(countDistinct("member_id") as "count") is

SELECT COUNT(DISTINCT member_id) FROM table

and COUNT(*) uses different rules than COUNT(column) when nulls are encountered.

回答3:

1st command :

DF.agg(countDistinct("member_id") as "count")

return the same as that of select count distinct(member_id) from DF.

2nd command :

DF.distinct.count

is actually getting distinct records or removing al duplicates from the DF and then taking the count.

The difference between countDistinct and distinct.

问题:

回答1:

回答2:

回答3:

收藏的人(0)

The difference between countDistinct and distinct.

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮