I am building metrics system for Spark Streaming job, in the system, the metrics are collected in each executor, so a metrics source (a class used to collect metrics) needs to be initialized in each executor.
The metrics source is packaged in a jar, when submitting a job, the jar is sent from local to each executor using the parameter '--jars', however, the executor starts to initialize the metrics source class before the jar arrives, as a result, it throws class not found exception.
It seems that if the executor could wait until all resources are ready, the issue will be resolved, but I really do not know how to do it.
Is there anyone facing the same issue?
PS: I tried using HDFS (copy the jar to HDFS, then submit the job and let the executor load class from a path in HDFS), but it fails. I checked the source code, it seems that the class loader can only resolve local path.
Here is the log, you can see that the jar is added to classpath at 2016-01-15 18:08:07, but the initialization starts at 2016-01-15 18:07:26
INFO 2016-01-15 18:08:07 org.apache.spark.executor.Executor: Adding file:/var/lib/spark/worker/worker-0/app-20160115180722-0041/0/./datainsights-metrics-source-assembly-1.0.jar to class loader
ERROR 2016-01-15 18:07:26 Logging.scala:96 - org.apache.spark.metrics.MetricsSystem: Source class org.apache.spark.metrics.PerfCounterSource cannot be instantiated
Here is the command I use:
spark-submit --verbose \
--jars /tmp/datainsights-metrics-source-assembly-1.0.jar \
--conf "spark.metrics.conf=metrics.properties" \
--class org.microsoft.ofe.datainsights.StartServiceSignalPipeline \
./target/datainsights-1.0-jar-with-dependencies.jar