I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing some apps which require element positional information. For example, after converting an ArrayList into a (Java)RDD, for each integer in RDD, I need to know its (global) array subscript. Is it possible to do it?
As I know, there is a take(int) function for RDD, so I believe the positional information is still maintained in RDD.
Essentially, RDD's zipWithIndex() method seems to do this, but it won't preserve the original ordering of the data the RDD was created from. At least you'll get a stable ordering.
The reason you're unlikely to find something that preserves the order in the original data is buried in the API doc for zipWithIndex():
So it looks like the original order is discarded. If preserving the original order is important to you, it looks like you need to add the index before you create the RDD.
I believe in most cases, zipWithIndex() will do the trick, and it will preserve the order. Read the comments again. My understanding is that it exactly means keep the order in the RDD.
Above example confirm it. The red has 3 partitions, and a with index 0, b with index 1, etc.