spark-submit: --jars does not work

2020-07-23 04:18发布

问题:

I am building metrics system for Spark Streaming job, in the system, the metrics are collected in each executor, so a metrics source (a class used to collect metrics) needs to be initialized in each executor.

The metrics source is packaged in a jar, when submitting a job, the jar is sent from local to each executor using the parameter '--jars', however, the executor starts to initialize the metrics source class before the jar arrives, as a result, it throws class not found exception.

It seems that if the executor could wait until all resources are ready, the issue will be resolved, but I really do not know how to do it.

Is there anyone facing the same issue?

PS: I tried using HDFS (copy the jar to HDFS, then submit the job and let the executor load class from a path in HDFS), but it fails. I checked the source code, it seems that the class loader can only resolve local path.

Here is the log, you can see that the jar is added to classpath at 2016-01-15 18:08:07, but the initialization starts at 2016-01-15 18:07:26

INFO 2016-01-15 18:08:07 org.apache.spark.executor.Executor: Adding file:/var/lib/spark/worker/worker-0/app-20160115180722-0041/0/./datainsights-metrics-source-assembly-1.0.jar to class loader

ERROR 2016-01-15 18:07:26 Logging.scala:96 - org.apache.spark.metrics.MetricsSystem: Source class org.apache.spark.metrics.PerfCounterSource cannot be instantiated

Here is the command I use:

spark-submit --verbose \
 --jars /tmp/datainsights-metrics-source-assembly-1.0.jar \ 
 --conf "spark.metrics.conf=metrics.properties" \
 --class org.microsoft.ofe.datainsights.StartServiceSignalPipeline \
 ./target/datainsights-1.0-jar-with-dependencies.jar

回答1:

I could think of couple of Options: -

  1. Create a Fat Jar File, which includes main classes and dependencies.
  2. If dependencies are used only by executors and not by the Driver, then you can explicitly add jar files using SparkConf.setJars(....) or in case it is used by driver too, then you can also use command line option --driver-class-path for configuring Driver classpath.
  3. Try configure it in Spark-default.conf using following parameters: -

    spark.executor.extraClassPath=<classapth>
    spark.executor.extraClassPath=<classapth>
    

No matter what you do, I would suggest to fix the network latency, otherwise it will hurt the performance of Spark jobs.