I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).
Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.
When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?
When FileSystem.open(path)
gets executed in Spark tasks, File
content will be loaded to local variable in same JVM process and prepares
the RDD ( partition(s) ). so the data locality for that RDD is always
PROCESS_LOCAL
-- vanekjar has
already commented the on question
Additional information about data locality in Spark:
There are several levels of locality based on the data’s current location. In order from closest to farthest:
- PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
- NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
- NO_PREF data is accessed equally quickly from anywhere and has no locality preference
- RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
- ANY data is elsewhere on the network and not in the same rack
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.
Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).