Is there a way to set the preferred locations of RDD partitions manually? I want to make sure certain partition be computed in a certain machine.
I'm using an array and the 'Parallelize' method to create a RDD from that.
Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node.
Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.
Spark uses
RDD.preferredLocations
to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).As you see the method is
final
which means that no one can ever override it.When you look at the source code of
RDD.preferredLocations
you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.
If you
parallelize
your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?If however you insist and do really want to use Spark for local datasets, the RDD behind
SparkContext.parallelize
is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.Let's then rephrase your question to the following (hoping I won't lose any important fact):
To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.
In other words, rather than using
parallelise
you have to usemakeRDD
(which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.