I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS
and hive.metastore.warehouse.dir
before HiveContext
is created.
Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.
val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)
The table metadata goes in the right place but the written files go to /user/hive/warehouse.
If a dataframe is saved without an explicit path, e.g.,
df.write.saveAsTable("tbl")
the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath
, which uses the location
of the default database, which seems to be cached during the HiveContext
construction, thus setting fs.defaultFS
after HiveContext
construction has no effect.
As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl
only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir
== fs.defaultFS
+ user/hive/warehouse
.
In short, how can configuration properties such as fs.defaultFS
and hive.metastore.warehouse.dir
be set programmatically before the HiveContext
constructor runs?
The
spark-testing-base
library has aTestHiveContext
configured as part of the setup forDataFrameSuiteBaseLike
. Even if you're unable to usescala-testing-base
directly for some reason, you can see how they make the configuration work.In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.
For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.