So far I have run Spark only on Linux machines and VMs (bridged networking) but now I am interesting on utilizing more computers as slaves. It would be handy to distribute a Spark Slave Docker container on computers and having them automatically connecting themselves to a hard-coded Spark master ip. This short of works already but I am having trouble configuring the right SPARK_LOCAL_IP (or --host parameter for start-slave.sh) on slave containers.
I think I correctly configured the SPARK_PUBLIC_DNS env variable to match the host machine's network-accessible ip (from 10.0.x.x address space), at least it is shown on Spark master web UI and accessible by all machines.
I have also set SPARK_WORKER_OPTS and Docker port forwards as instructed at http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html, but in my case the Spark master is running on an other machine and not inside Docker. I am launching Spark jobs from an other machine within the network, possibly also running a slave itself.
Things that I've tried:
- Not configure SPARK_LOCAL_IP at all, slave binds to container's ip (like 172.17.0.45), cannot be connected to from master or driver, computation still works most of the time but not always
- Bind to 0.0.0.0, slaves talk to master and establish some connection but it dies, an other slave shows up and goes away, they continue looping like this
- Bind to host ip, start fails as that ip is not visible within the container but would be reachable by others as port-forwarding is configured
I wonder why isn't the configured SPARK_PUBLIC_DNS being used when connecting to slaves? I thought SPARK_LOCAL_IP would only affect on local binding but not being revealed to external computers.
At https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html they instruct to "set SPARK_LOCAL_IP to a cluster-addressable hostname for the driver, master, and worker processes", is this the only option? I would avoid the extra DNS configuration and just use ips to configure traffic between computers. Or is there an easy way to achieve this?
Edit:
To summarize the current set-up:
- Master is running on Linux (VM at VirtualBox on Windows with bridged networking)
- Driver submits jobs from an other Windows machine, works great
- Docker image for starting up slaves is distributed as a "saved" .tar.gz file, loaded (curl xyz | gunzip | docker load) and started on other machines within the network, has this probem with private/public ip configuration
I'm running 3 different types of docker containers on my machine with the intention of deploying them into the cloud when all the software we need are added to them: Master, Worker and Jupyter notebook (with Scala, R and Python kernels).
Here are my observations so far:
Master:
- I couldn't make it bind to the Docker Host IP. Instead, I pass in a made up domain name to it:
-h "dockerhost-master" -e SPARK_MASTER_IP="dockerhost-master"
. I couldn't find a way to make Akka bind against the container's IP and but accept messages against the host IP. I know it's possible with Akka 2.4, but maybe not with Spark.
- I'm passing in
-e SPARK_LOCAL_IP="${HOST_IP}"
which causes the Web UI to bind against that address instead of the container's IP, but the Web UI works all right either way.
Worker:
- I gave the worker container a different hostname and pass it as
--host
to the Spark org.apache.spark.deploy.master.Worker
class. It can't be the same as the master's or the Akka cluster will not work: -h "dockerhost-worker"
- I'm using Docker's
add-host
so the container is able to resolve the hostname to the master's IP: --add-host dockerhost-master:${HOST_IP}
- The master URL that needs to be passed is
spark://dockerhost-master:7077
Jupyter:
- This one needs the master URL and
add-host
to be able to resolve it
- The
SparkContext
lives in the notebook and that's where the web UI of the Spark Application is started, not the master. By default it binds to the internal IP address of the Docker container. To change that I had to pass in: -e SPARK_PUBLIC_DNS="${VM_IP}" -p 4040:4040
. Subsequent applications from the notebook would be on 4041, 4042, etc.
With these settings the three components are able to communicate with each other. I'm using custom startup scripts with spark-class
to launch the classes in the foreground and keep the Docker containers from quitting at the moment.
There are a few other ports that could be exposed such as the history server which I haven't encountered yet. Using --net host
seems much simpler.
I think I found a solution for my use-case (one Spark container / host OS):
- Use
--net host
with docker run
=> host's eth0 is visible in the container
- Set
SPARK_PUBLIC_DNS
and SPARK_LOCAL_IP
to host's ip, ignore the docker0's 172.x.x.x address
Spark can bind to the host's ip and other machines communicate to it as well, port forwarding takes care of the rest. DNS or any complex configs were not needed, I haven't thoroughly tested this but so far so good.
Edit: Note that these instructions are for Spark 1.x, at Spark 2.x only SPARK_PUBLIC_DNS
is required, I think SPARK_LOCAL_IP
is deprecated.
I am also running spark in containers on different docker hosts. Starting the worker container with these arguments worked for me:
docker run \
-e SPARK_WORKER_PORT=6066 \
-p 6066:6066 \
-p 8081:8081 \
--hostname $PUBLIC_HOSTNAME \
-e SPARK_LOCAL_HOSTNAME=$PUBLIC_HOSTNAME \
-e SPARK_IDENT_STRING=$PUBLIC_HOSTNAME \
-e SPARK_PUBLIC_DNS=$PUBLIC_IP \
spark ...
where $PUBLIC_HOSTNAME
is a hostname reachable from the master.
The missing piece was SPARK_LOCAL_HOSTNAME
, an undocumented option AFAICT.
https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L904