I spark-submit jobs on an Amazon EMR cluster. I'd like all spark logging to be sent to redis/logstash. What is the proper way to configure spark under EMR to do this?
Keep log4j: Add a bootstrap action to modify /home/hadoop/spark/conf/log4j.properties to add an appender? However, this file already contains a lot of stuff and is a symlink to hadoop conf file. I don't want to fiddle too much with that as it already contains some rootLoggers. Which appender would do best? ryantenney/log4j-redis-appender + logstash/log4j-jsonevent-layout OR pavlobaron/log4j2redis ?
Migrate to slf4j+logback: Exclude slf4j-log4j12 from spark-core, add log4j-over-slf4j ... and use a logback.xml with a com.cwbase.logback.RedisAppender? Looks like this will be problematic with dependencies. Will it hide log4j.rootLoggers already defined in log4j.properties?
Anything else I missed?
What are your thoughts on this?
Update
Looks like I can't get second option to work. Running tests is just fine but using spark-submit (with --conf spark.driver.userClassPathFirst=true) always end up with the dreaded "Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowError."
I would setup an extra daemon for that on the cluster.