In Spark's client mode, the driver needs netwo

2019-04-10 06:57发布

问题:

When using spark at client mode (e.g. yarn-client), does the local machine that runs the driver communicates directly with the cluster worker nodes that run the remote executors?

If yes, does it mean the machine (that runs the driver) need to have network access to the worker nodes? So the master node requests resources from the cluster, and returns the IP addresses/ports of the worker nodes to the driver, so the driver can initiating the communication with the worker nodes?

If not, how does the client mode actually work?

If yes, does it mean that the client mode won't work if the cluster is configured in a way that the work nodes are not visible outside the cluster, and one will have to use cluster mode?

Thanks!

回答1:

The Driver connects to the Spark Master, requests a context, and then the Spark Master passes the Spark Workers the details of the Driver to communicate and get instructions on what to do.

The means that the driver node must be available on the network to the workers, and it's IP must be one that's visible to them (i.e. if the driver is behind NAT, while the workers are in a different network, it won't work and you'll see errors on the workers that they fail to connect to the driver)



回答2:

When you run Spark in client mode, the driver process runs locally. In cluster mode, it runs remotely on an ApplicationMaster.

In other words you will need all the nodes to see each other. Spark driver definitely needs to communicate with all the worker nodes. If this is a problem try to use the yarn-cluster mode, then the driver will run inside your cluster on one of the nodes.