Spark spark-submit --jars arguments wants comma li

2019-01-06 23:33发布


In Submitting Applications in the Spark docs, as of 1.6.0 and earlier, it's not clear how to specify the --jars argument, as it's apparently not a colon-separated classpath not a directory expansion.

The docs say "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes."

Question: What are all the options for submitting a classpath with --jars in the spark-submit script in $SPARK_HOME/bin? Anything undocumented that could be submitted as an improvement for docs?

I ask because when I was testing --jars today, we had to explicitly provide a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar

We are choosing to pre-populate the cluster with all the jars in /usr/local/spark/jars on each worker, it seemed that if no local:/ file:/ or hdfs: was supplied, then the default is file:/ and the driver makes the jars available on a webserver run by the driver. I chose local, as above.

And it seems that we do not need to put the main jar in the --jars argument, I have not tested yet if other classes in the final argument (application-jar arg per docs, i.e. /usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I need to put the application-jar in the --jars path to get classes not named after --class to be seen.

(And granted with Spark standalone mode using --deploy-mode client, you also have to put a copy of the driver on each worker but you don't know up front which worker will run the driver)


In this way it worked easily.. instead of specifying each jar with version separately..

# build all other dependent jars in OTHER_JARS

JARS=`find ../lib -name '*.jar'`
   for eachjarinlib in $JARS ; do    
echo ---final list of jars are : $OTHER_JARS

spark-submit --verbose --class <yourclass>
  • Using tr unix command also can help like the below example.

    --jars $(echo /dir/of/jars/*.jar | tr ' ' ',')


One way (the only way?) to use the --jars argument is to supply a comma-separated list of explicitly named jars. The only way I figured out to use the commas was a StackOverflow answer that led me to look beyond the docs to the command line:

spark-submit --help 

The output from that command contains:

 --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths. 

Today when I was testing --jars, we had to explicitly provide a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar