I have a Hive table. If I have a requirement that the data will be coming into the Hive table daily. If the data which is coming in is a new record(inserts) then the record should be inserted into hive table or if the data which is coming in is already existing(updates) in hive then the record should be updated.
Can anyone explain how this is achieved in Hive.
I was checking online i found this article. http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
There are serveral ways to do this and it depends on:
I think the link you've posted describes the process pretty well, thou it's very specific about the technologies used. More general way to describe this would be:
I strongly recommend you to go through Hive doc and figure out yourself how to do each step :)
Cheers,
Karol
"the data will be coming into the Hive table daily" - is a part of Data Ingestion. You can use Sqoop Incremental Import for the same. Two ways in which it can be coded.
(1) -- append , use when you know the last value coming in or
(2) --last modified, use when you have a DATE column which can be used to track the inserts.
For updates, you can use External tables as explained in the link you shared.