Compute HIVE statistics in Apache Spark

2019-08-11 17:01发布

I'm trying to compute HIVE table statistic from Apache Spark:

`sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS')`

I also execute statement to see what was collected:

sqlCtx.sql('DESC FORMATTED t1')

I can see my stats was collected. However when I execute same staement in HIVE client (Ambari) - there are no statistics displayed. Is it available only to Spark if it's collected by Spark? Does spark store it somewhere else?

Another question.

I also computing stats for all columns in that table:

sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS c1,c2')

But when I want to see this stats in spark, it failed with unsupported sql statement exception:

sqlCtx.sql('DESC FORMATTED t1 c1')

According to docs it's valid hive queries. What is wrong with it?

Thanks for help.

2条回答
倾城 Initia
2楼-- · 2019-08-11 17:39

Apache Spark stores statistics as "Table parameters". To be able retrieve these stats, we need to connect to HIVE metastore and . execute query like following

select param_key, param_value 
from table_params tp, tbls t 
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>' 
and param_key like 'spark.sql.stat%';
查看更多
何必那么认真
3楼-- · 2019-08-11 17:58

just uppercase the name of table will be ok.

select param_key, param_value 
from TABLE_PARAMS tp, TBLS t 
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>' 
and param_key like 'spark.sql.stat%';
查看更多
登录 后发表回答