I have a requirement to find median for multiple double datatype columns.Request suggestion to find the correct approach.
Below is my sample dataset with one column. I am expecting the median value to be returned as 1 for my sample.
scala> sqlContext.sql("select num from test").show();
+---+
|num|
+---+
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|1.0|
+---+
I tried the following options
1) Hive UDAF percentile, it worked only for BigInt.
2) Hive UDAT percentile_approx, but it does not work as expected (returns 0.25 vs 1).
sqlContext.sql("select percentile_approx(num,0.5) from test").show();
+----+
| _c0|
+----+
|0.25|
+----+
3) Spark window function percent_rank- to find median the way i see is to look for all percent_rank above 0.5 and pick the max percent_rank's corresponding num value. But it does not work in all cases, especially when i have even record counts, in such case the median is the average of the middle value in the sorted distribution.
Also in the percent_rank, as i have to find the median for multiple columns, i have to calculate it in different dataframes, which to me is little complex method. Please correct me, if my understanding is not right.
+---+-------------+
|num|percent_rank |
+---+-------------+
|0.0|0.0|
|0.0|0.0|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
+---+---+
Which version of Apache Spark are you using out of curiosity? There were some fixes within the Apache Spark 2.0+ which included changes to
approxQuantile
.If I was to run the pySpark code snippet below:
with the
median
calculation usingapproxQuantile
as:or
the results are:
Note, as these are the approximate numbers (via
approxQuantile
) though in general this should work well. If you need the exact median, one approach is to usenumpy.median
. The code snippet below is updated for thisdf
example based on gench's SO response to How to find the median in Apache Spark with Python Dataframe API?:with the output of:
Updated: Spark 1.6 Scala version using RDDs
If you are using Spark 1.6, you can calculate the
median
using Scala code via Eugene Zhulenev's response How can I calculate the exact median with Apache Spark. Below is the modified code that works with our example.with the output of:
Note, this is calculating the exact median using
RDDs
- i.e. you will need to convert the DataFrame column into an RDD to perform this calculation.