I have a hive table which is partitioned by year, month, day and hour. I need to run a query against it to fetch the last 7 days data. This is in Hive 0.14.0.2.2.4.2-2
. My query currently looks like this :
SELECT COUNT(column_name) from table_name
where year >= year(date_sub(from_unixtime(unix_timestamp()), 7))
AND month >= month(date_sub(from_unixtime(unix_timestamp()), 7))
AND day >= day(date_sub(from_unixtime(unix_timestamp()), 7));
This takes a very long time. When I substitute the actual numbers for the above say something like :
SELECT COUNT(column_name) from table_name
where year >= 2017
AND month >= 2
AND day >= 13
it finishes in a few minutes. Is there any way to change the above script so that is actually includes just the numbers in the query instead of the functions?
I tried using set
like:
set yearLimit = year(date_sub(from_unixtime(unix_timestamp()), 7));
SELECT COUNT(column_name) from table_name
where year >= ${hiveconf:yearLimit}
AND month >= month(date_sub(from_unixtime(unix_timestamp()), 7))
AND day >= day(date_sub(from_unixtime(unix_timestamp()), 7));
but this does not solve the issue.
Solution
What went wrong with the original query?
(I've just changed the documentation a little bit :-))
Since unix_timestamp() values might change during the execution, the expression should be evaluated for each row, therefore preventing partitions elimination.
Why using
SET
did not work?set
is nothing but a text replacement mechanism.Nothing is being computed during the
set
.The only thing that happens is that variables are being assigned a text.
Before the query is being executed the variables place holders (
${hiveconf:...}
) are being replaced with the assigned text.Only then the query is being parsed and executed.
Demo