I just try to find a way to get the locality of a RDD's partition in Spark.
After calling RDD.repartition()
or PairRDD.combineByKey()
the returned RDD is partitioned. I'd like to know which worker instances the partitions are at (for examining the partition behaviour)?!
Can someone give a clue?
An interesting question that I'm sure has not so much interesting answer :)
First of all, applying transformations to your RDD has nothing to do with worker instances as they are separate "entities". Transformations create a RDD lineage (= a logical plan) while executors come to stage (no pun intended) only after an action is executed (and DAGScheduler transforms the logical plan into execution plan as a set of stages with tasks).
So, I believe the only way to know what executor a partition is executed at is to use org.apache.spark.SparkEnv to access the BlockManager that corresponds to a single executor. That's exactly how Spark knows/tracks executors (by their BlockManagers).
You could write a org.apache.spark.scheduler.SparkListener that would intercept onExecutorAdded
, onBlockManagerAdded
and their *Removed
counterparts to know how to map executors to BlockManagers (but believe SparkEnv
is enough).