I'm attempting to access Accumulo 1.6 from an Apache Spark job (written in Java) by using an AccumuloInputFormat
with newAPIHadoopRDD
. In order to do this, I have to tell the AccumuloInputFormat
where to locate ZooKeeper by calling the setZooKeeperInstance
method. This method takes a ClientConfiguration
object which specifies various relevant properties.
I'm creating my ClientConfiguration
object by calling the static loadDefault
method. This method is supposed to look in various places for a client.conf
file to load its defaults from. One of the places it's supposed to look is $ACCUMULO_CONF_DIR/client.conf
.
Therefore, I am attempting to set the ACCUMULO_CONF_DIR
environment variable in such a way that it will be visible when Spark runs the job (for reference, I'm attempting to run in the yarn-cluster
deployment mode). I have not yet found a way to do that successfully.
So far, I've tried:
- Calling
setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf")
on theSparkConf
- Exporting
ACCUMULO_CONF_DIR
inspark-env.sh
- Setting
spark.executorEnv.ACCUMULO_CONF_DIR
inspark-defaults.conf
None of them have worked. When I print the environment before calling setZooKeeperInstance
, ACCUMULO_CONF_DIR
does not appear.
If it's relevant, I'm using the CDH5 versions of everything.
Here's an example of what I'm trying to do (imports and exception handling left out for brevity):
public class MySparkJob
{
public static void main(String[] args)
{
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("MySparkJob");
sparkConf.setExecutorEnv("ACcUMULO_CONF_DIR", "/etc/accumulo/conf");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Job accumuloJob = Job.getInstance(sc.hadoopConfiguration());
// Foreach loop to print environment, shows no ACCUMULO_CONF_DIR
ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
AccumuloInputFormat.setZooKeeperInstance(accumuloJob, accumuloConfiguration);
// Other calls to AccumuloInputFormat static functions to configure it properly.
JavaPairRDD<Key, Value> accumuloRDD =
sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
AccumuloInputFormat.class,
Key.class,
Value.class);
}
}
So I discovered the answer to this while writing the question (sorry, reputation seekers). The problem is that CDH5 uses Spark 1.0.0, and that I was running the job via YARN. Apparently, YARN mode does not pay any attention to the executor environment and instead uses the environment variable
SPARK_YARN_USER_ENV
to control its environment. So ensuringSPARK_YARN_USER_ENV
containsACCUMULO_CONF_DIR=/etc/accumulo/conf
works, and makesACCUMULO_CONF_DIR
visible in the environment at the indicated point in the question's source example.This difference in how standalone mode and YARN mode work resulted in SPARK-1680, which is reported as fixed in Spark 1.1.0.