Hive query performance is slow when using Hive dat

2020-06-23 08:54发布

问题:

I have a transaction table table_A that gets updated every day. Every day I insert new data into table_A from external table_B using the file_date field to filter the necessary data from external table_B to insert into table_A. However, there's a huge performance difference if I use a hardcoded date vs. using the Hive date functions:

-- Fast version (~20 minutes)
SET date_ingest = '2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
    INTO
        TABLE
            table_A PARTITION (FILE_DATE) SELECT
                    id, eventtime
                    ,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
                FROM
                    table_B
                WHERE
                    file_date = ${hiveconf:date_ingest}
;

compared to:

-- Slow version (~9 hours)
SET date_ingest = date_add(to_date(from_unixtime( unix_timestamp( ) )),-1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
    INTO
        TABLE
            table_A PARTITION (FILE_DATE) SELECT
                    id, eventtime
                    ,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
                FROM
                    table_B
                WHERE
                    file_date = ${hiveconf:date_ingest}
;

Has anyone experienced similar issues? You should assume that I don't have access to the Unix hive command (i.e. can't use --hiveconf options) since we're using a third party UI.

回答1:

Sometimes partition pruning does not work when using functions in filter clause. If you calculate the variable in the wrapper shell script and pass it as -hiveconf variable to the Hive, it will work fine. Example:

#inside shell script
date_ingest=$(date -d '-1 day' +%Y-%m-%d)
hive -f your_script.hql -hiveconf date_ingest="$date_ingest" 

Then use it inside Hive script as WHERE file_date ='${hiveconf:date_ingest}'