I'm using spark-ec2 to run some Spark code. When I set master to
"local", then it runs fine. However, when I set master to $MASTER,
the workers immediately fail, with java.lang.NoClassDefFoundError for
the classes. The workers connect to the master, and show up in the UI, and try to run the task; but immediately raise that exception as soon as it loads its first dependency class (which is in the assembly jar).
I've used sbt-assembly to make a jar with the classes, confirmed using
jar tvf that the classes are there, and set SparkConf to distribute
the classes. The Spark Web UI indeed shows the assembly jar to be
added to the classpath:
http://172.x.x.x47441/jars/myjar-assembly-1.0.jar
It seems that, despite the fact that myjar-assembly contains the
class, and is being added to the cluster, it's not reaching the
workers. How do I fix this? (Do I need to manually copy the jar file?
If so, to which dir? I thought that the point of the SparkConf add
jars was to do this automatically)
My attempts at debugging have shown:
- The assembly jar is being copied to /root/spark/work/app-xxxxxx/1/
(Determined by ssh to worker and searching for jar)
- However, that path doesn't appear on the worker's classpath
(Determined from logs, which show java -cp but lack that file)
So, it seems like I need to tell Spark to add the path to the assembly
jar to the worker's classpath. How do I do that? Or is there another culprit? (I've spent hours trying to debug this but to no avail!)
NOTE: EC2 specific answer, not a general Spark answer. Just trying to round out an answer to a question asked a year ago, one that has the same symptom but often different causes and trips up a lot of people.
If I am understanding the question correctly, you are asking, "Do I need to manually copy the jar file? If so, to which dir?" You say, "and set SparkConf to distribute the classes" but you are not clear if this is done via spark-env.sh or spark-defaults.conf? So making some assumptions, the main one being your are running in cluster mode, meaning your driver runs on one of the workers and you don't know which one in advance... then...
The answer is yes, to the dir named in the classpath. In EC2 the only persistent data storage is /root/persistent-hdfs, but I don't know if that's a good idea.
In the Spark docs on EC2 I see this line:
To deploy code or data within your cluster, you can log in and use
the provided script ~/spark-ec2/copy-dir, which, given a directory
path, RSYNCs it to the same location on all the slaves.
SPARK_CLASSPATH
I wouldn't use SPARK_CLASSPATH because it's deprecated as of Spark 1.0 so a good idea is to use its replacement in $SPARK_HOME/conf/spark-defaults.conf:
spark.executor.extraClassPath /path/to/jar/on/worker
This should be the option that works. If you need to do this on the fly, not in a conf file, the recommendation is "./spark-submit with --driver-class-path to augment the driver classpath" (from Spark docs about spark.executor.extraClassPath and see end of answer for another source on that).
BUT ... you are not using spark-submit ... I don't know how that works in EC2, looking at the script I didn't figure out where EC2 let's you supply these parameters on a command line. You mention you already do this in setting up your SparkConf object so stick with that if that works for you.
I see in Spark-years this is a very old question so I wonder how you resolved it? I hope this helps someone, I learned a lot researching the specifics of EC2.
I must admit, as a limitation on this, it confuses me in the Spark docs that for spark.executor.extraClassPath it says:
Users typically should not need to set this option
I assume they mean most people will get the classpath out through a driver config option. I know most of the docs for spark-submit make it should like the script handles moving your code around the cluster but I think that's only in "standalone client mode" which I assume you are not using, I assume EC2 must be in "standalone cluster mode."
MORE / BACKGROUND ON SPARK_CLASSPATH deprecation:
More background that leads me to think SPARK_CLASSPATH is deprecated is this archived thread. and this one, crossing the other thread and this one about a WARN message when using SPARK_CLASSPATH:
14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:
/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
You are require to register a jar with spark cluster while submitting your app, to make it possible you can edit your code as follows.
jars(0) = "/usr/local/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar"
val conf: SparkConf = new SparkConf()
.setAppName("Busigence App")
.setMaster(sparkMasterUrl)
.setSparkHome(sparkHome)
.setJars(jars);