Hive configuration for Spark integration tests

2019-05-29 05:10发布

问题:

I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS and hive.metastore.warehouse.dir before HiveContext is created.

Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.

val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)

The table metadata goes in the right place but the written files go to /user/hive/warehouse.

If a dataframe is saved without an explicit path, e.g.,

df.write.saveAsTable("tbl")

the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath, which uses the location of the default database, which seems to be cached during the HiveContext construction, thus setting fs.defaultFS after HiveContext construction has no effect.

As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir == fs.defaultFS + user/hive/warehouse.

In short, how can configuration properties such as fs.defaultFS and hive.metastore.warehouse.dir be set programmatically before the HiveContext constructor runs?

回答1:

In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.

For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.



回答2:

The spark-testing-base library has a TestHiveContext configured as part of the setup for DataFrameSuiteBaseLike. Even if you're unable to use scala-testing-base directly for some reason, you can see how they make the configuration work.