How to integrate Ganglia for Spark 2.1 Job metrics

2019-09-14 21:11发布

问题:

I am trying to integrate Spark 2.1 job's metrics to Ganglia.

My spark-default.conf looks like

*.sink.ganglia.class org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.name Name
*.sink.ganglia.host $MASTERIP
*.sink.ganglia.port $PORT

*.sink.ganglia.mode unicast
*.sink.ganglia.period 10
*.sink.ganglia.unit seconds

When i submit my job i can see the warn

Warning: Ignoring non-spark config property: *.sink.ganglia.host=host
Warning: Ignoring non-spark config property: *.sink.ganglia.name=Name
Warning: Ignoring non-spark config property: *.sink.ganglia.mode=unicast
Warning: Ignoring non-spark config property: *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
Warning: Ignoring non-spark config property: *.sink.ganglia.period=10
Warning: Ignoring non-spark config property: *.sink.ganglia.port=8649
Warning: Ignoring non-spark config property: *.sink.ganglia.unit=seconds

My environment details are

Hadoop : Amazon 2.7.3 - emr-5.7.0  
Spark  : Spark 2.1.1, 
Ganglia: 3.7.2

If you have any inputs or any other alternative of Ganglia please reply.

回答1:

according to the spark docs

The metrics system is configured via a configuration file that Spark expects to be present at $SPARK_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property.

so instead of having these confs in spark-default.conf, move them to $SPARK_HOME/conf/metrics.properties



回答2:

For EMR specifically, you'll need to put these settings in /etc/spark/conf/metrics.properties on the master node.

Spark on EMR does include the Ganglia library:

$ ls -l /usr/lib/spark/external/lib/spark-ganglia-lgpl_*
-rw-r--r-- 1 root root 28376 Mar 22 00:43 /usr/lib/spark/external/lib/spark-ganglia-lgpl_2.11-2.3.0.jar

In addition, your example is missing the equals sign (=) between the config names and values - unsure if that's an issue. Below is an example config that worked successfully for me.

*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.name=AMZN-EMR
*.sink.ganglia.host=$MASTERIP
*.sink.ganglia.port=8649

*.sink.ganglia.mode=unicast
*.sink.ganglia.period=10
*.sink.ganglia.unit=seconds


回答3:

From this page: https://spark.apache.org/docs/latest/monitoring.html

Spark also supports a Ganglia sink which is not included in the default build due to licensing restrictions:

GangliaSink: Sends metrics to a Ganglia node or multicast group.
**To install the GangliaSink you’ll need to perform a custom build of Spark**. Note that by embedding this library you will include LGPL-licensed code in your Spark package. For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile. In addition to modifying the cluster’s Spark build user