Hive table data management

2019-06-03 03:52发布

I have a Hive table. If I have a requirement that the data will be coming into the Hive table daily. If the data which is coming in is a new record(inserts) then the record should be inserted into hive table or if the data which is coming in is already existing(updates) in hive then the record should be updated.

Can anyone explain how this is achieved in Hive.

I was checking online i found this article. http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

标签： hive

2条回答

Explosion°爆炸

2楼-- · 2019-06-03 04:40

There are serveral ways to do this and it depends on:

What are your requirements exactly,
What version of Hive you're using (since 0.14 Hive supports full CRUD),
What is the format of source data (if it's some RDBMS, you could use Sqoop incremental load)
How large is the data you have to load

I think the link you've posted describes the process pretty well, thou it's very specific about the technologies used. More general way to describe this would be:

Create external table on the the source data,
Append new data to destination table,
Remove duplicates based on unique key or timestamp (fe. using GROUP BY).

I strongly recommend you to go through Hive doc and figure out yourself how to do each step :)

Cheers,
Karol

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2019-06-03 04:55

"the data will be coming into the Hive table daily" - is a part of Data Ingestion. You can use Sqoop Incremental Import for the same. Two ways in which it can be coded.

(1) -- append , use when you know the last value coming in or

(2) --last modified, use when you have a DATE column which can be used to track the inserts.

For updates, you can use External tables as explained in the link you shared.

0人赞添加讨论(0) 举报

Hive table data management

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间