edf.select("x").distinct.show()
shows the distinct values that are present in x
column of edf
DataFrame.
Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)
countDistinct
is probably the first choice:
import org.apache.spark.sql.functions.countDistinct
df.agg(countDistinct("some_column"))
If speed is more important than the accuracy you may consider approx_count_distinct
(approxCountDistinct
in Spark 1.x):
import org.apache.spark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct("some_column"))
To get values and counts:
df.groupBy("some_column").count()
In SQL (spark-sql
):
SELECT COUNT(DISTINCT some_column) FROM df
and
SELECT approx_count_distinct(some_column) FROM df
Another option without resorting to sql functions
df.groupBy('your_column_name').count().show()
show will print the different values and their occurrences. The result without show will be a dataframe.
import org.apache.spark.sql.functions.countDistinct
df.groupBy("a").agg(countDistinct("s")).collect()
df.select("some_column").distinct.count
If you are using Java, the import org.apache.spark.sql.functions.countDistinct;
will give an error :
The import org.apache.spark.sql.functions.countDistinct cannot be resolved
To use the countDistinct
in java, use the below format:
import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;
df.agg(functions.countDistinct("some_column"));