I have a Hive table. If I have a requirement that the data will be coming into the Hive table daily. If the data which is coming in is a new record(inserts) then the record should be inserted into hive table or if the data which is coming in is already existing(updates) in hive then the record should be updated.
Can anyone explain how this is achieved in Hive.
I was checking online i found this article.
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
There are serveral ways to do this and it depends on:
- What are your requirements exactly,
- What version of Hive you're using (since 0.14 Hive supports full CRUD),
- What is the format of source data (if it's some RDBMS, you could use Sqoop incremental load)
- How large is the data you have to load
I think the link you've posted describes the process pretty well, thou it's very specific about the technologies used. More general way to describe this would be:
- Create external table on the the source data,
- Append new data to destination table,
- Remove duplicates based on unique key or timestamp (fe. using GROUP BY).
I strongly recommend you to go through Hive doc and figure out yourself how to do each step :)
Cheers,
Karol
"the data will be coming into the Hive table daily" - is a part of Data Ingestion. You can use Sqoop Incremental Import for the same. Two ways in which it can be coded.
(1) -- append , use when you know the last value coming in or
(2) --last modified, use when you have a DATE column which can be used to track the inserts.
For updates, you can use External tables as explained in the link you shared.