For example, I run a the following work count application on the Spark platform:
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
And assume there is one worker need to handle 1Gb data, then is it possible that this worker will start doing some computation(like flatMap) before fetching all 1Gb data ?
Generally speaking, yes it can, but yet your question a bit broad. So I don't know you are looking for an answer for a specific case or not.
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, I mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
Sometimes you need to share resource between different users.
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Usually everything depends on the scheduler that you use and for what intent.
Ref. Official documentation > Job Scheduling > Scheduling Within an Application.
Spark evaluates RDD operations lazily (in other words, until a result is requested) so no data is performed on or read until you invoke an action such as saveAsTextFile
.