I am trying with the "sample" method of RDD on Spark 1.6.1
scala>val nu = sc.parallelize(1 to 10)
scala>val sp = nu.sample(true,0.2)
scala>sp.collect.foreach(println(_))
3 8
scala>val sp2 = nu.sample(true, 0.2)
scala>sp2.collect.foreach(println(_))
2 4 7 8 10
I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong?
sample method on RDD,
The return type is undocumented, so it could be anything from your master RDD.
Elaborating on the previous answer: in the documentation (scroll down to
sample
) it is mentioned (emphasis mine):'Expected' can have several meanings depending on the context, but one meaning it certainly does not have is 'exact', hence the varying exact number of the sample size.
If you want absolutely fixed sample sizes, you may use the
takeSample
method, the downside being that it returns an array (i.e. not an RDD), which must fit in your main memory:The fraction does not mean give me this number of element exactly. It says give me this number of elements on average so you will have different numbers of elements if you run several time.