Hive query performance for high cardinality field

2020-01-27 07:37发布

问题:

I have a single but huge table in hive which will almost always be queried with the primary key column (say, employee_id). The table will be really huge, millions of rows inserted each day and I want to query fast using partitions over this field. I followed this post and I know that partitioning is only good for low cardinality fields, so how can I accomplish my goal of querying fast with employee_id column?

I understand that id column having very high cardinality should be used as bucketing but it does not help me with the query performance over single table, does it?

I think that if I could use something like hash(employee_id) as partitions, it would help me very much. Is this possible? I couldn't see such a thing in the documents about hive.

To summarize, what I want is fast query result for:

select * from employee where employee_id=XXX

assuming employee table has billions of records, with primary key column employee_id where classical partitioning by year, month, day etc does not help.

Thanks in advance,

回答1:

1) Use ORC with bloom filters:

CREATE TABLE employee (
  employee_id bigint,
  name STRING
) STORED AS ORC 
TBLPROPERTIES ("orc.bloom.filter.columns"="employee_id")
;

2) Enable PPD with vectorizing, use CBO and Tez:

SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.execution.engine=tez;
set hive.stats.fetch.column.stats=true;
set hive.map.aggr=true;
SET hive.tez.auto.reducer.parallelism=true; 

3) Tune proper parallelism on mappers and reducers:

--example for mappers:

    set tez.grouping.max-size=67108864;
    set tez.grouping.min-size=32000000;


--example settings for reducers: 

    set hive.exec.reducers.bytes.per.reducer=67108864; --decrease this to increase the number of reducers

Change these figures to achieve optimal performance.



标签: hadoop hive