How to do Stratified sampling with Spark DataFrame

2019-05-14 00:27发布

This question already has an answer here:

Stratified sampling in Spark 2 answers

I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK

标签： apache-spark apache-spark-mllib

1条回答

在下西门庆

2楼-- · 2019-05-14 00:57

Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:

val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.

0人赞添加讨论(0) 举报

How to do Stratified sampling with Spark DataFrame

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间