I have created a Spark cluster on Openstack running on Ubuntu14.04 with 8gb of ram. I created two virtual machines with 3gb each (keeping 2 gb for the parent OS). Further, i create a master and 2 workers from first virtual machine and 3 workers from second machine.
The spark-env.sh file has basic setting with
export SPARK_MASTER_IP=10.0.0.30
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_CORES=1
Whenever i deploy the cluster with start-all.sh, i get "failed to launch org.apache.spark.deploy.worker.Worker" and some times "failed to launch org.apache.spark.deploy.master.Master". When i see the log file to look for error i get the following
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/bin/java -cp >/home/ubuntu/spark-1.5.1/sbin/../conf/:/home/ubuntu/spark->1.5.1/assembly/target/scala-2.10/spark-assembly-1.5.1->hadoop2.2.0.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-api->jdo-3.2.6.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-core->3.2.10.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-rdbms->3.2.9.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m >org.apache.spark.deploy.master.Master --ip 10.0.0.30 --port 7077 --webui->port 8080
Though I get the fail message but the master or worker become alive after a few seconds.
Can someone please explain the reason?
The Spark configuration system is a mess of environment variables, argument flags, and Java Properties files. I just spent a couple hours tracking down the same warning, and unraveling the Spark initialization procedure, and here's what I found:
sbin/start-all.sh
callssbin/start-master.sh
(and thensbin/start-slaves.sh
)sbin/start-master.sh
callssbin/spark-daemon.sh start org.apache.spark.deploy.master.Master ...
sbin/spark-daemon.sh start ...
forks off a call tobin/spark-class org.apache.spark.deploy.master.Master ...
, captures the resulting process id (pid), sleeps for 2 seconds, and then checks whether that pid's command's name is "java"bin/spark-class
is a bash script, so it starts out with the command name "bash", and proceeds to:bin/load-spark-env.sh
java
executablejava ... org.apache.spark.launcher.Main ...
to get the full classpath needed for a Spark deploymentexec
, tojava ... org.apache.spark.deploy.master.Master
, at which point the command name becomes "java"If steps 4.1 through 4.5 take longer than 2 seconds, which in my (and your) experience seems pretty much inevitable on a fresh OS where
java
has never been previously run, you'll get the "failed to launch" message, despite nothing actually having failed.The slaves will complain for the same reason, and thrash around until the master is actually available, but they should keep retrying until they successfully connect to the master.
I've got a pretty standard Spark deployment running on EC2; I use:
conf/spark-defaults.conf
to setspark.executor.memory
and add some custom jars viaspark.{driver,executor}.extraClassPath
conf/spark-env.sh
to setSPARK_WORKER_CORES=$(($(nproc) * 2))
conf/slaves
to list my slavesHere's how I start a Spark deployment, bypassing some of the
{bin,sbin}/*.sh
minefield/maze:I'm still using
sbin/start-daemon.sh
to start the slaves, since that's easier than callingnohup
within thessh
command:There! It assumes that I'm using all the default ports and stuff, and that I'm not doing stupid shit like putting whitespace in filenames, but I think it's cleaner this way.