可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a Hive table. If I have a requirement that the data will be coming into the Hive table daily. If the data which is coming in is a new record(inserts) then the record should be inserted into hive table or if the data which is coming in is already existing(updates) in hive then the record should be updated.

Can anyone explain how this is achieved in Hive.

I was checking online i found this article. http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

回答1:

There are serveral ways to do this and it depends on:

What are your requirements exactly,
What version of Hive you're using (since 0.14 Hive supports full CRUD),
What is the format of source data (if it's some RDBMS, you could use Sqoop incremental load)
How large is the data you have to load

I think the link you've posted describes the process pretty well, thou it's very specific about the technologies used. More general way to describe this would be:

Create external table on the the source data,
Append new data to destination table,
Remove duplicates based on unique key or timestamp (fe. using GROUP BY).

I strongly recommend you to go through Hive doc and figure out yourself how to do each step :)

Cheers,
Karol

回答2:

"the data will be coming into the Hive table daily" - is a part of Data Ingestion. You can use Sqoop Incremental Import for the same. Two ways in which it can be coded.

(1) -- append , use when you know the last value coming in or

(2) --last modified, use when you have a DATE column which can be used to track the inserts.

For updates, you can use External tables as explained in the link you shared.