the sample method of Spark RDD does not work as ex

2019-07-22 05:20发布

I am trying with the "sample" method of RDD on Spark 1.6.1

scala>val nu = sc.parallelize(1 to 10)
scala>val sp =  nu.sample(true,0.2)
scala>sp.collect.foreach(println(_))

3 8

scala>val sp2 = nu.sample(true, 0.2)
scala>sp2.collect.foreach(println(_))

2 4 7 8 10

I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong?

3条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-07-22 05:36

sample method on RDD,

Return a sampled subset of this RDD.

The return type is undocumented, so it could be anything from your master RDD.

查看更多
对你真心纯属浪费
3楼-- · 2019-07-22 05:53

Elaborating on the previous answer: in the documentation (scroll down to sample) it is mentioned (emphasis mine):

fraction: expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0

'Expected' can have several meanings depending on the context, but one meaning it certainly does not have is 'exact', hence the varying exact number of the sample size.

If you want absolutely fixed sample sizes, you may use the takeSample method, the downside being that it returns an array (i.e. not an RDD), which must fit in your main memory:

val nu = sc.parallelize(1 to 10)
/** set seed for reproducibility */
val sp1 = nu.takeSample(true, 2, 182453) 
sp1: Array[Int] = Array(7, 2)

val sp2 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(2, 10)

val sp3 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(4, 6)
查看更多
虎瘦雄心在
4楼-- · 2019-07-22 06:00

The fraction does not mean give me this number of element exactly. It says give me this number of elements on average so you will have different numbers of elements if you run several time.

查看更多
登录 后发表回答