How to tune hive to query metadata?

2019-06-26 05:09发布

In case I am running a below hive query on table with certain partitioned column, I want to make sure hive does not do full table scan and just figure out the result from meta data itself. Is there any way to enable this ?

Select max(partitioned_col) from hive_table ;

Right now , when I am running this query , its launching map reduce tasks and I am sure its doing data scan while it can very well figure out the value from metadata itself.

标签： performance hadoop hive hdfs tez

1条回答

劳资没心，怎么记你

2楼-- · 2019-06-26 05:42

Compute table statistics every time you changed data.

ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS FOR COLUMNS;

Enable CBO and statistics auto gathering:

set hive.cbo.enable=true;
set hive.stats.autogather=true;

Use these settings to enable CBO using statistics:

set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;

If nothing helps I'd recommend to apply this approach for finding last partition fast: Parse max partition key using shell script from the table location. The command below will print all table folder paths, sort, take latest sorted, take last subfolder name, parse partition folder name and extract value. All you need is to initialize TABLE_DIR variable and put the number of partition subfolder in the path:

last_partition=$(hadoop fs -ls $TABLE_DIR/* | awk '{ print $8 }' | sort -r | head -n1 | cut -d / -f [number of partition subfolder in the path here] | cut -d = -f 2

Then use $last_partition variable to pass to your script as

  hive -hiveconf last_partition="$last_partition" -f your_script.hql

0人赞添加讨论(0) 举报

How to tune hive to query metadata?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间