How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).
I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)
But how can I Shuffle the rdd?
Or is there a better way to split the list?
I've found a simple and fast way to split the array:
It will split the data using the provided weights.
You should use
randomSplit
method:Here is its implementation in spark 1.0: