Hive configuration for Spark integration tests

I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS and hive.metastore.warehouse.dir before HiveContext is created.

Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.

val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)

The table metadata goes in the right place but the written files go to /user/hive/warehouse.

If a dataframe is saved without an explicit path, e.g.,

df.write.saveAsTable("tbl")

the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath, which uses the location of the default database, which seems to be cached during the HiveContext construction, thus setting fs.defaultFS after HiveContext construction has no effect.

As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir == fs.defaultFS + user/hive/warehouse.

In short, how can configuration properties such as fs.defaultFS and hive.metastore.warehouse.dir be set programmatically before the HiveContext constructor runs?

标签： scala hadoop apache-spark hive scalatest

2条回答

▲ chillily

2楼-- · 2019-05-29 05:39

The spark-testing-base library has a TestHiveContext configured as part of the setup for DataFrameSuiteBaseLike. Even if you're unable to use scala-testing-base directly for some reason, you can see how they make the configuration work.

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-05-29 05:41

In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.

For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.

0人赞添加讨论(0) 举报

Hive configuration for Spark integration tests

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间