I have a single but huge table in hive which will almost always be queried with the primary key column (say, employee_id
). The table will be really huge, millions of rows inserted each day and I want to query fast using partitions over this field. I followed this post and I know that partitioning is only good for low cardinality fields, so how can I accomplish my goal of querying fast with employee_id
column?
I understand that id column having very high cardinality should be used as bucketing but it does not help me with the query performance over single table, does it?
I think that if I could use something like hash(employee_id)
as partitions, it would help me very much. Is this possible? I couldn't see such a thing in the documents about hive.
To summarize, what I want is fast query result for:
select * from employee where employee_id=XXX
assuming employee
table has billions of records, with primary key column employee_id
where classical partitioning by year, month, day etc does not help.
Thanks in advance,