Hive: Best way to do incremetal updates on a main

So I have a main table in Hive, it will store all my data.

I want to be able to load a incremental data update about every month with a large amount of data couple billion rows. There will be new data as well as updated entries.

What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.

What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.

The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.

Anyone got any ideas as to how to do this most efficiently and effectively ?? I'm pretty new to Hive and databases in general.

标签： java hadoop hive nosql

1条回答

smile是对你的礼貌

2楼-- · 2019-01-03 11:37

If you cannot update in ACID mode using MERGE then it's possible to update using FULL OUTER JOIN. To find all entries that will be updated you need to join increment data with old data:

insert overwrite target_data [partition() if applicable]
SELECT
  --select new if exists, old if not exists
  case when i.PK is not null then i.PK   else t.PK   end as PK,
  case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
  ... 
  case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
  FROM 
      target_data t --restrict partitions if applicable
      FULL JOIN increment_data i on (t.PK=i.PK);

It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined.

Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(): https://stackoverflow.com/a/44755825/2700344

0人赞添加讨论(0) 举报

Hive: Best way to do incremetal updates on a main

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间