How do simple random sampling and dataframe SAMPLE

2020-02-01 09:14发布

问题:

Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

Q2. How is the sample obtained after random number generation?

Thanks in advance

回答1:

How is the sample obtained after random number generation?

Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?

it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

If fraction is above RandomSampler.defaultMaxGapSamplingFraction sampling is done by a simple filter:

items.filter { _ => rng.nextDouble() <= fraction }

otherwise, simplifying things a little bit, it is repeatedly calling drop method using random integers and takes next item.

Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.



回答2:

The RDD API includes takeSample, which will return a "sample of specified size in an array". It works by calling sample until it gets a sample size greater than the requested one, then randomly taking the specified number from that. The code comments that it shouldn't have to iterate often due to a bias toward large sample sizes.