Filter Partition Before Reading Hive table (Spark)

2019-07-29 07:53发布

Currently I'm trying to filter a Hive table by the latest date_processed.

The table is partitioned by.

System date_processed Region

The only way I've managed to filter it, is by doing a join query:

query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"

This way is really time consuming, as I have to do the same procedure for 25 tables.

Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6

This is the method I'm using to read.

public static DataFrame loadAndFilter (String query)
{
        return SparkContextSingleton.getHiveContext().sql(+query);
}

Many thanks!

标签： apache-spark hive hdfs

1条回答

2楼-- · 2019-07-29 08:37

Dataframe with all table partitions can be received by:

val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")

Values can be parsed, for get max value.

0人赞添加讨论(0) 举报