I am developing a batch job that loads data into Hive tables from HDFS files. The flow of data is as follows
- Read the file received in HDFS using an external Hive table
- INSERT OVERWRITE the final hive table from the external Hive table applying certain transformations
- Move the received file to Archive
This flow works fine if there is a file in the input directory for the external table to read during step 1. If there is no file, the external table will be empty and as a result executing step 2 will empty the final table. If the external table is empty, I would like to keep the existing data in the final table (the data loaded during the previous execution).
Is there a hive property that I can set so that the final table is overwritten only if we are overwriting it with some data?
I know that I can check if the input file exists using an HDFS command and conditionally launch the Hive requests. But I am wondering if I can achieve the same behavior directly in Hive which would help me avoid this extra verification
Try to add dummy partition to your table, say LOAD_TAG and use dynamic partition load:
The partition value should always be the same in your case.