I have a transaction table table_A
that gets updated every day. Every day I insert new data into table_A
from external table_B
using the file_date
field to filter the necessary data from external table_B
to insert into table_A
. However, there's a huge performance difference if I use a hardcoded date vs. using the Hive date functions:
-- Fast version (~20 minutes)
SET date_ingest = '2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;
INSERT
INTO
TABLE
table_A PARTITION (FILE_DATE) SELECT
id, eventtime
,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
FROM
table_B
WHERE
file_date = ${hiveconf:date_ingest}
;
compared to:
-- Slow version (~9 hours)
SET date_ingest = date_add(to_date(from_unixtime( unix_timestamp( ) )),-1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;
INSERT
INTO
TABLE
table_A PARTITION (FILE_DATE) SELECT
id, eventtime
,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
FROM
table_B
WHERE
file_date = ${hiveconf:date_ingest}
;
Has anyone experienced similar issues? You should assume that I don't have access to the Unix hive command (i.e. can't use --hiveconf options) since we're using a third party UI.
Sometimes partition pruning does not work when using functions in filter clause. If you calculate the variable in the wrapper shell script and pass it as -hiveconf variable to the Hive, it will work fine. Example:
Then use it inside Hive script as
WHERE file_date ='${hiveconf:date_ingest}'