This question already has an answer here:
-
Stratified sampling in Spark
2 answers
I'm in Spark 1.3.0 and my data is in DataFrames.
I need operations like sampleByKey(), sampleByKeyExact().
I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157).
That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames.
Thanks & Regards
MK
Spark 1.1 added stratified sampling routines SampleByKey
and SampleByKeyExact
to Spark Core, so since then they are available without MLLib dependencies.
These two functions are PairRDDFunctions
and belong to key-value RDD[(K,T)]
. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample
is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df
.