SPARK Is sample method on Dataframes uniform sampl

2019-02-03 17:51发布

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

Thanks

标签： apache-spark sample spark-dataframe

2条回答

做自己的国王

2楼-- · 2019-02-03 18:07

There are a few code paths here:

If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two

0人赞添加讨论(0) 举报

Ridiculous、

3楼-- · 2019-02-03 18:09

yes it is uniform, for more information you can try below code. I hope this clarifies.

I think this should do the trick, where "data" is your data frame . val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1))

0人赞添加讨论(0) 举报

SPARK Is sample method on Dataframes uniform sampl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间