I have a dataset
with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
Any help please to get a dataframe
in which we'll find columns and number of missing values for each one.
Many thanks
You could count the missing values by summing the boolean output of the isNull()
method, after converting it to type integer:
In Scala
:
import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show
In Python
:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
Alternatively, you could also use the output of df.describe().filter($"summary" === "count")
, and subtract the number in each cell by the number of rows in the data:
In Scala
:
import org.apache.spark.sql.functions.lit,
val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show
In Python
:
from pyspark.sql.functions import lit
rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()