spark dataframe groupping does not count nulls

2020-02-02 03:16发布

I have a spark DataFrame which is grouped by a column aggregated with a count:

df.groupBy('a').agg(count("a")).show

+---------+----------------+
|a        |count(a)        |
+---------+----------------+
|     null|               0|
|      -90|           45684|
+---------+----------------+


df.select('a').filter('aisNull').count

returns

warning: there was one feature warning; re-run with -feature for details
res9: Long = 26834

which clearly shows that the null values were not counted initially.

What is the reason for this behaviour? I would have expected (if nullat all is contained in the grouping result) to properly see the counts.

标签： sql apache-spark group-by null apache-spark-sql

2条回答

Viruses.

2楼-- · 2020-02-02 03:38

What is the reason for this behaviour?

SQL-92 standard. In particular (emphasis mine):

Let T be the argument or argument source of a <set function specification>.

If COUNT(*) is specified, then the result is the cardinality of T.

Otherwise, let TX be the single-column table that is the result of applying the <value expression> to each row of T and eliminating null values.

If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be TX.

If the COUNT is specified, then the result is the cardinality of TXA.

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2020-02-02 03:44

Yes, count applied to a specific column does not count the null-values. If you want to include the null-values, use:

df.groupBy('a).agg(count("*")).show

0人赞添加讨论(0) 举报

spark dataframe groupping does not count nulls

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间