SPARK Is sample method on Dataframes uniform sampl

2019-02-03 17:51发布

I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?

Thanks

2条回答
做自己的国王
2楼-- · 2019-02-03 18:07

There are a few code paths here:

  • If withReplacement = false && fraction > .4 then it uses a souped up random number generator (rng.nextDouble() <= fraction) and lets that do the work. This seems like it would be pretty uniform.
  • If withReplacement = false && fraction <= .4 then it uses a more complex algorithm (GapSamplingIterator) that also seems pretty uniform. At a glance, it looks like it should be uniform also
  • If withReplacement = true it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first two
查看更多
Ridiculous、
3楼-- · 2019-02-03 18:09

yes it is uniform, for more information you can try below code. I hope this clarifies.

I think this should do the trick, where "data" is your data frame . val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1))

查看更多
登录 后发表回答