I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration
- master on machine 1 (port 7077 exposed)
- worker on machine 1
- driver on machine 2
I use a test application that opens a file and counts its lines.
The application works when the file replicated on all workers and I use SparkContext.readText()
But when when the file is only present on worker while I'm using SparkContext.parallelize()
to access it on workers, I have the following display :
INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```
that goes on and on again without actually computing the app.
This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)
TL;DR Make sure that
spark.driver.host:spark.driver.port
can be accessed from each node in the cluster.In general you have ensure that all nodes (both executors and master) can reach the driver.
spark.driver.host
has to resolve to a publicly reachable address.In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting
spark.driver.port
. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.Furthermore:
won't work. All inputs have to be accessible from driver, as well as, from each executor node.