Assuming the existence of an RDD of tuples similar to the following:
(key1, 1)
(key3, 9)
(key2, 3)
(key1, 4)
(key1, 5)
(key3, 2)
(key2, 7)
...
What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to:
- Use the
colStats
function in MLLib: This approach has the advantage of easily-adaptable to use othermllib.stat
functions later, if other statistical computations are deemed necessary. However, it operates on an RDD ofVector
containing the data for each column, so as I understand it, this approach would require that the full set of values for each key be collected on a single node, which would seem non-ideal for large data sets. Does a SparkVector
always imply that the data in theVector
be resident locally, on a single node? - Perform a
groupByKey
, thenstats
: Likely shuffle-heavy, as a result of thegroupByKey
operation. - Perform
aggregateByKey
, initializing a newStatCounter
, and usingStatCounter::merge
as the sequence and combiner functions: This is the approach recommended by this StackOverflow answer, and avoids thegroupByKey
from option 2. However, I haven't been able to find good documentation forStatCounter
in PySpark.
I like Option 1 because it makes the code more extensible, in that it could easily accommodate more complicated calculations using other MLLib functions with similar contracts, but if the Vector
inputs inherently require that the data sets be collected locally, then it limits the data sizes on which the code can effectively operate. Between the other two, Option 3 looks more efficient because it avoids the groupByKey
, but I was hoping to confirm that that is the case.
Are there any other options I haven't considered? (I am currently using Python + PySpark, but I'm open to solutions in Java/Scala as well, if there is a language difference.)