Using HDP 2.5.3 and I've been trying to debug some YARN container classpath issues.
Since HDP includes both Spark 1.6 and 2.0.0, there have been some conflicting versions
Users I support are successfully able to use Spark2 with Hive queries in YARN client
mode, but not from cluster
mode they get errors about tables not found, or something like that because the Metastore connection isn't established.
I am guessing that setting either --driver-class-path /etc/spark2/conf:/etc/hive/conf
or passing --files /etc/spark2/conf/hive-site.xml
after spark-submit
would work, but why isn't hive-site.xml
loaded already from the conf
folder?
Accoringing to Hortonworks docs, says hive-site
should be placed in $SPARK_HOME/conf
, and it is...
I see hdfs-site.xml
and core-site.xml
, and other files that are part of HADOOP_CONF_DIR
, for example, and this is the from the YARN UI container info.
2232355 4 drwx------ 2 yarn hadoop 4096 Aug 2 21:59 ./__spark_conf__
2232379 4 -r-x------ 1 yarn hadoop 2358 Aug 2 21:59 ./__spark_conf__/topology_script.py
2232381 8 -r-x------ 1 yarn hadoop 4676 Aug 2 21:59 ./__spark_conf__/yarn-env.sh
2232392 4 -r-x------ 1 yarn hadoop 569 Aug 2 21:59 ./__spark_conf__/topology_mappings.data
2232398 4 -r-x------ 1 yarn hadoop 945 Aug 2 21:59 ./__spark_conf__/taskcontroller.cfg
2232356 4 -r-x------ 1 yarn hadoop 620 Aug 2 21:59 ./__spark_conf__/log4j.properties
2232382 12 -r-x------ 1 yarn hadoop 8960 Aug 2 21:59 ./__spark_conf__/hdfs-site.xml
2232371 4 -r-x------ 1 yarn hadoop 2090 Aug 2 21:59 ./__spark_conf__/hadoop-metrics2.properties
2232387 4 -r-x------ 1 yarn hadoop 662 Aug 2 21:59 ./__spark_conf__/mapred-env.sh
2232390 4 -r-x------ 1 yarn hadoop 1308 Aug 2 21:59 ./__spark_conf__/hadoop-policy.xml
2232399 4 -r-x------ 1 yarn hadoop 1480 Aug 2 21:59 ./__spark_conf__/__spark_conf__.properties
2232389 4 -r-x------ 1 yarn hadoop 1602 Aug 2 21:59 ./__spark_conf__/health_check
2232385 4 -r-x------ 1 yarn hadoop 913 Aug 2 21:59 ./__spark_conf__/rack_topology.data
2232377 4 -r-x------ 1 yarn hadoop 1484 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml
2232383 4 -r-x------ 1 yarn hadoop 1020 Aug 2 21:59 ./__spark_conf__/commons-logging.properties
2232357 8 -r-x------ 1 yarn hadoop 5721 Aug 2 21:59 ./__spark_conf__/hadoop-env.sh
2232391 4 -r-x------ 1 yarn hadoop 281 Aug 2 21:59 ./__spark_conf__/slaves
2232373 8 -r-x------ 1 yarn hadoop 6407 Aug 2 21:59 ./__spark_conf__/core-site.xml
2232393 4 -r-x------ 1 yarn hadoop 812 Aug 2 21:59 ./__spark_conf__/rack-topology.sh
2232394 4 -r-x------ 1 yarn hadoop 1044 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-security.xml
2232395 8 -r-x------ 1 yarn hadoop 4956 Aug 2 21:59 ./__spark_conf__/metrics.properties
2232386 8 -r-x------ 1 yarn hadoop 4221 Aug 2 21:59 ./__spark_conf__/task-log4j.properties
2232380 4 -r-x------ 1 yarn hadoop 64 Aug 2 21:59 ./__spark_conf__/ranger-security.xml
2232372 20 -r-x------ 1 yarn hadoop 19975 Aug 2 21:59 ./__spark_conf__/yarn-site.xml
2232397 4 -r-x------ 1 yarn hadoop 1006 Aug 2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml
2232374 4 -r-x------ 1 yarn hadoop 29 Aug 2 21:59 ./__spark_conf__/yarn.exclude
2232384 4 -r-x------ 1 yarn hadoop 1606 Aug 2 21:59 ./__spark_conf__/container-executor.cfg
2232396 4 -r-x------ 1 yarn hadoop 1000 Aug 2 21:59 ./__spark_conf__/ssl-server.xml
2232375 4 -r-x------ 1 yarn hadoop 1 Aug 2 21:59 ./__spark_conf__/dfs.exclude
2232359 8 -r-x------ 1 yarn hadoop 7660 Aug 2 21:59 ./__spark_conf__/mapred-site.xml
2232378 16 -r-x------ 1 yarn hadoop 14474 Aug 2 21:59 ./__spark_conf__/capacity-scheduler.xml
2232376 4 -r-x------ 1 yarn hadoop 884 Aug 2 21:59 ./__spark_conf__/ssl-client.xml
As you might see, hive-site
is not there, even though I definitely have conf/hive-site.xml
for spark-submit to take
[spark@asthad006 conf]$ pwd && ls -l
/usr/hdp/2.5.3.0-37/spark2/conf
total 32
-rw-r--r-- 1 spark spark 742 Mar 6 15:20 hive-site.xml
-rw-r--r-- 1 spark spark 620 Mar 6 15:20 log4j.properties
-rw-r--r-- 1 spark spark 4956 Mar 6 15:20 metrics.properties
-rw-r--r-- 1 spark spark 824 Aug 2 22:24 spark-defaults.conf
-rw-r--r-- 1 spark spark 1820 Aug 2 22:24 spark-env.sh
-rwxr-xr-x 1 spark spark 244 Mar 6 15:20 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive hadoop 918 Aug 2 22:24 spark-thrift-sparkconf.conf
So, I don't think I am supposed to place hive-site in HADOOP_CONF_DIR
as HIVE_CONF_DIR
is separated, but my question is that how do we get Spark2 to pick up the hive-site.xml
without needing to manually pass it as a parameter at runtime?
EDIT Naturally, since I'm on HDP I am using Ambari. The previous cluster admin has installed Spark2 clients on all of the machines, so all of the YARN NodeManagers that could be potential Spark drivers should have the same config files
Found an issue with this
You create a
org.apache.spark.sql.SQLContext
before creating hive context thehive-site.xml
is not picked properly when you create hive context.Solution : Create the hive context before creating another SQL context.
The way I understand it, in
local
oryarn-client
modes...>
hive-site.xml
is searched in the CLASSPATH by the Hive/Hadoop client libs (including indriver.extraClassPath
because the Driver runs inside the Launcher and the merged CLASSPATH is already built at this point)> that's
$SPARK_CONF_DIR/hive-site.xml
>
hive-site.xml
is searched in the CLASSPATH by the Hive/Hadoop client libs (and the Kerberos token is used, if any)So you can have one
hive-site.xml
stating that Spark should use an embedded, in-memory Derby instance to use as a sandbox (in-memory implying "stop leaving all these temp files behind you") while anotherhive-site.xml
gives the actual Hive Metastore URI. And all is well.Now, in
yarn-cluster
mode, all that mechanism pretty much explodes in a nasty, undocumented mess.The Launcher needs its own CLASSPATH settings to create the Kerberos tokens, otherwise it fails silently. Better go to the source code to find out which undocumented Env variable you shoud use.
It may also need an override in some properties because the hard-coded defaults suddenly are not the defaults any more (silently).
The Driver cannot tap the original
$SPARK_CONF_DIR
, it has to rely on what the Launcher has made available for upload. Does that include a copy of$SPARK_CONF_DIR/hive-site.xml
? Looks like it's not the case.So you are probably using a Derby thing as a stub.
And the Driver has to to do with whatever YARN has forced on the container CLASSPATH, in whatever order.
Besides, the
driver.extraClassPath
additions do NOT take precedence by default; for that you have to forcespark.yarn.user.classpath.first=true
(which is translated to the standard Hadoop property whose exact name I can't remember right now, especially since there are multiple props with similar names that may be deprecated and/or not working in Hadoop 2.x)Think that's bad? Try out connecting to a Kerberized HBase in
yarn-cluster
mode. The connection is done in the Executors, that's another layer of nastyness. But I disgress.Bottom line: start your diagnostic again.
A. Are you really, really sure that the mysterious "Metastore connection errors" are caused by missing properties, and specifically the Metastore URI?
B. By the way, are your users explicitly using a
HiveContext
???C. What is exactly the CLASSPATH that YARN presents to the Driver JVM, and what is exactly the CLASSPATH that the Driver presents to the Hadoop libs when opening the Metastore connection?
D. If the CLASSPATH built by YARN is messed up for some reason, what would be the minimal fix -- change in precedence rules? addition? both?
You can use spark property -
spark.yarn.dist.files
and specify path to hive-site.xml there.In the
cluster
mode
configuration is read from theconf
directory of the machine, which runs thedriver
container, not the one use forspark-submit
.