I'm trying to understand Apache Spark's internals. I wonder if Spark uses some mechanisms to ensure data locality when reading from InputFormat or writing to an OutputFormat (or other formats natively supported by Spark and not derived from MapReduce).
In the first case (reading), my understanding is that, when using InputFormat, the splits get associated with the host (or hosts??) that contain the data so Spark tries to assign tasks to executors in order to reduce network transfer as much as possible.
In the case of writing, how such a mechanism would work? I know that technically, a file in HDFS can be saved in any node locally and replicated to other two (so you use the network for two out of 3 replicas), but, if you consider writing to other systems, such as NoSQL database (Cassandra, HBase, others.. ), such systems have their own way of distributing data. Is there a way to tell spark to partition an RDD in a way that optimize data locality on the basis of the distribution of data expected by the output sink (target NoSQL database, seen natively or through an OutputFormat) ?
I refer to an environment in which Spark nodes and NoSQL nodes live in the same phisical machines.
If you use Spark and Cassandra on the same physical machine, you should check out spark-cassandra-connector It will ensure data locality for both reads and writes.
For example, if you load a Cassandra table into an RDD, the connector will always try to do the operations on this RDD locally on each node. And when you save the RDD into Cassandra, the connector will also try to save results locally as well.
This assuming that your data is already balanced across your Cassandra cluster. If your PartitionKey is not done correctly, you will end up with an unbalanced cluster anyway.
Also be aware of shuffling jobs on Spark. For example, if you perform a ReduceByKey on an RDD, you'll end up streaming data across the network anyway. So, always plan these jobs carefully.