What happens when Spark master fails?

2019-02-12 15:18发布

问题:

Does the driver need constant access to the master node? Or is it only required to get initial resource allocation? What happens if master is not available after Spark context has been created? Does it mean application will fail?

回答1:

The first and probably the most serious for the time being consequence of a master failure or a network partition is that your cluster won't be able to accept new applications. This is why Master is considered to be a single point of failure when cluster is used with default configuration.

Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:

  • application won't be able to finish gracefully
  • if master is down, or network partition affects worker nodes as well, slaves will try to reregisterWithMaster. If this fails multiple times workers will simply give up. At this moment long running applications (like streaming apps) won't be able to continue processing but it still shouldn't result in immediate failure. Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.


回答2:

Below are the steps spark application does, when it starts,

  1. Starts the Spark Driver
  2. Spark Driver, connects to spark master for resource allocation.
  3. Spark Driver, sends the jar attached in spark context to master server.
  4. Spark Driver, keeps polling master server to get the job status.
  5. If there is a shuffling or broadcast in code, data is routed via spark driver. That is why, it is required for spark driver to have sufficient memory.
  6. If there is any operation like take, takeOrdered, or collect, data is accumulater on driver.

So, yes, failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.



回答3:

Yes, the driver and master communicate constantly throughout the SparkContext's lifetime. That allows driver to:

  • Display detailed status of jobs / stages / tasks on its Web Interface and REST API
  • Listen on job start and end events (you can add your own listeners)
  • Wait for jobs to end (via synchronous API - e.g. rdd.count() won't terminate until job is completed) and get their result

A disconnect between driver and master will fail the job.