This question already has an answer here:
- Stratified sampling in Spark 2 answers
I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK
Spark 1.1 added stratified sampling routines
SampleByKey
andSampleByKeyExact
to Spark Core, so since then they are available without MLLib dependencies.These two functions are
PairRDDFunctions
and belong to key-valueRDD[(K,T)]
. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:Note that
sample
is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined fordf
.