This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?
Thanks
This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?
Thanks
You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:
Using Spark 2.0+ and the DataFrame API you can use the
approxQuantile
method:It will also work on multiple columns at the same time since Spark version 2.2. By setting
probabilites
toArray(0.5)
andrelativeError
to 0, it will compute the exact median. From the documentation:Despite this, there seems to be some issues with the precision when setting
relativeError
to 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low
relativeError
:The median returned is 50.0.