I want to choose randomly a select number of rows from a dataframe and I know sample method does this, but I am concerned that my randomness should be uniform sampling? So, I was wondering if the sample method of Spark on Dataframes is uniform or not?
Thanks
There are a few code paths here:
withReplacement = false && fraction > .4
then it uses a souped up random number generator (rng.nextDouble() <= fraction
) and lets that do the work. This seems like it would be pretty uniform.withReplacement = false && fraction <= .4
then it uses a more complex algorithm (GapSamplingIterator
) that also seems pretty uniform. At a glance, it looks like it should be uniform alsowithReplacement = true
it does close to the same thing, except it can duplicate by the looks of it, so this looks to me like it would not be as uniform as the first twoyes it is uniform, for more information you can try below code. I hope this clarifies.
I think this should do the trick, where "data" is your data frame . val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1))