import numpy as np
df = spark.createDataFrame(
[(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
Thats why i have created a new question.
I know i can use isnull() function in spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
You can use method shown here and replace isNull
with isnan
:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
You can create an UDF
to ckeck both null
and NaN
and return the boolean
value to filter
The code is scala code hope you can convert to python.
val isNaN = udf((value : Float) => {
if (value.equals(Float.NaN) || value == null) true else false })
val result = data.filter(isNaN(data("column2"))).count()
Hope this helps !