Can someone example the computation of median/quantiles in map reduce?
My understanding of Datafu's median is that the 'n' mappers sort the data and send the data to "1" reducer which is responsible for sorting all the data from n mappers and finding the median(middle value) Is my understanding correct?,
if so, does this approach scale for massive amounts of data as i can clearly see the one single reducer struggling to do the final task. Thanks
In many real-world scenarios, the cardinality of values in a dataset will be relatively small. In such cases, the problem can be efficiently solved with two MapReduce jobs:
Job 1. will drastically reduce the amount of data and can be executed fully in parallel. Reducer of job 2. will only have to process
n
(n
=cardinality of your value set
) items instead of all values, as with the naive approach.Below, an example reducer of the job 2. It's is python script that could be used directly in Hadoop streaming. Assumes values in your dataset are
ints
, but can be easily adopted fordouble
sThis answer builds up on top of a suggestion initially coming from the answer of Chris White. The answer suggests using a combiner as a mean to calculate frequencies of values. However, in MapReduce, combiners are not guaranteed to be always executed. This has some side effects:
Do you really need the exact median and quantiles?
A lot of the time, you are better off with just getting approximate values, and working with them, in particular if you use this for e.g. data partitioning.
In fact, you can use the approximate quantiles to speed up finding the exact quantiles (actually in
O(n/p)
time), here is a rough outline of the strategy:O(n)
) to find the true quantile.Each of the steps is in linear time. The most costly step is part 3, as it will require the whole data set to be redistributed, so it generates
O(n)
network traffic. You can probably optimize the process by choosing "alternate" quantiles for the first iteration. Say, you want to find the global median. You can't find it in a linear process easily, but you can probably narrow it down to 1/kth of the data set, when it is split into k partitions. So instead of having each node report its median, have each node additionally report the objects at (k-1)/(2k) and (k+1)/(2k). This should allow you to narrow down the range of values where the true median must lie signficantly. So in the next step, you can each node send those objects that are within the desired range to a single master node, and choose the median within this range only.O((n log n)/p) to sort it then O(1) to get the median.
Yes... you can get O(n/p) but you can't use the out of the box sort functionality in Hadoop. I would just sort and get the center item unless you can justify the 2-20 hours of development time to code the parallel kth largest algorithm.
Trying to find the median (middle number) in a series is going to require that 1 reducer is passed the entire range of numbers to determine which is the 'middle' value.
Depending on the range and uniqueness of values in your input set, you could introduce a combiner to output the frequency of each value - reducing the number of map outputs sent to your single reducer. Your reducer can then consume the sort value / frequency pairs to identify the median.
Another way you could scale this (again if you know the range and rough distribution of values) is to use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on). This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)