RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy()
, as explained in this reply.
Now, which operations preserve that order?
E.g., is it guaranteed that (after a.sortBy()
)
a.map(f).zip(a) ===
a.map(x => (f(x),x))
How about
a.filter(f).map(g) ===
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)
what about
a.filter(f).flatMap(g) ===
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)
Here "equality" ===
is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).
All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a
sortBy
. For example, if you read a file (sc.textFile
) the lines of the RDD will be in the order that they were in the file.Without trying to give a complete list,
map
,filter
,flatMap
, andcoalesce
(withshuffle=false
) do preserve the order.sortBy
,partitionBy
,join
do not preserve the order.The reason is that most RDD operations work on
Iterator
s inside the partitions. Somap
orfilter
just has no way to mess up the order. You can take a look at the code to see for yourself.You may now ask: What if I have an RDD with a
HashPartitioner
. What happens when I usemap
to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can usepartitionBy
to restore the partitioning with a shuffle.