How can I effectively split up an RDD[T]
into a Seq[RDD[T]]
/ Iterable[RDD[T]]
with n
elements and preserve the original ordering?
I would like to be able to write something like this
RDD(1, 2, 3, 4, 5, 6, 7, 8, 9).split(3)
which should result in something like
Seq(RDD(1, 2, 3), RDD(4, 5, 6), RDD(7, 8, 9))
Does spark provide such a function? If not what is a performant way to achieve this?
val parts = rdd.length / n
val rdds = rdd.zipWithIndex().map{ case (t, i) => (i - (i % parts), t)}.groupByKey().values.map(iter => sc.parallelize(iter.toSeq)).collect
Doesn't look very fast..
You can, technically, do what you're suggesting. However, it really doesn't make sense in the context of leveraging a computing cluster to perform distributed processing of big data. It goes against the entire point of Spark in the first place. If you do a groupByKey and then try to extract these into separate RDDs, you are effectively pulling all of the data distributed in the RDD onto the driver and then re-distributing each one back to the cluster. If the driver can't load the entire data file, it won't be able to perform this operation either.
You should not be loading large data files onto the driver node from a local file system. You should move your file onto a distributed filesystem like HDFS or S3. Then you can load your single big data file onto your cluster by means of val lines = SparkContext.textFile(...)
into an RDD of lines. When you do this, each worker in the cluster will only load a portion of the file, which can be done because the data is already distributed across the cluster in the distributed filesystem.
If you then need to organize the data into "batches" that are important to the functional processing of the data, you can key the data with an appropriate batch-identifier, like: val batches = lines.keyBy( line => lineBatchID(line) )
Each batch can then be reduced to a batch-level summary, and these summaries can be reduced into a single overall result.
For the purposes of testing Spark code, it is fine to load a small sample of a data file onto a single machine. But when it comes to the full dataset, you should be leveraging a distributed filesystem in conjunction with a spark cluster to process this data.