Sqoop Incremental Import

2020-05-14 04:15发布

Need advice on Sqoop Incremental Imports. Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2, the same customer adds Policy 2 and after the incremental import sqoop run, will we get only new records in the part files? In that case, How do I get the Old and Incremental appended/last modified records using Sqoop?

标签: sqoop
8条回答
三岁会撩人
2楼-- · 2020-05-14 05:08

In such use cases always look for fields which are genuinely incremental in nature for incremental append. and for last modified look best suited field is modified_date or likewise some fields for those which have been changed since you sqoop-ed them. only those and those rows will be updated, adding newer rows in your hdfs location requires incremental append.

查看更多
爷的心禁止访问
3楼-- · 2020-05-14 05:09

let's take example here, you are having customer table with two columns cust_id and policy, also custid is your primary key and you just want to insert data cust id 100 onward

scenario 1:- append new data on the basis of cust_id field

phase1:-

below 3 records are there which are inserted recently in customer table which we want to import in HDFS

| custid | Policy |
| 101 | 1 |
| 102 | 2 | 
| 103 | 3 |

here is sqoop command for that

sqoop import \ 
--connect jdbc:mysql://localhost:3306/db \ 
--username root -P \ 
--table customer \ 
--target-dir /user/hive/warehouse/<your db>/<table> \
--append \
--check-column custid \
--incremental append \
--last-value 100

phase2:- below 4 records are there which are inserted recently in customer table which we want to import in HDFS

| custid | Policy |
| 104 | 4 |
| 105 | 5 | 
| 106 | 6 | 
| 107 | 7 | 

here is sqoop command for that

sqoop import \ 
--connect jdbc:mysql://localhost:3306/db \
--username root -P \ 
--table customer \ 
--target-dir /user/hive/warehouse/<your db>/<table> \ 
--append \
--check-column custid \
--incremental append \
--last-value 103

so these four properties we will have to cosider for inserting new records

--append \
--check-column <primary key> \
--incremental append \
--last-value <Last Value of primary key which sqoop job has inserted in last run>

scenario 2:- append new data +update existing data on the basis of cust_id field

below 1 new record with cust id 108 has inserted and cust id 101 and 102 has updated recently in customer table which we want to import in HDFS

| custid | Policy |
| 108 | 8 |
| 101 | 11 | 
| 102 | 12 | 

sqoop import \ 
--connect jdbc:mysql://localhost:3306/db \ 
--username root -P \ 
--table customer \ 
--target-dir /user/hive/warehouse/<your db>/<table> \
--append \
--check-column custid \
--incremental lastmodified \
--last-value 107

so these four properties we will have to cosider for insert/update records in same command

--append \
--check-column <primary key> \
--incremental lastmodified \
--last-value <Last Value of primary key which sqoop job has inserted in last run>

I am specifically mentioning primary key as if table is not having primary key then few more properties needs to be consider which are:-

multiple mapper perform the sqoop job by default so mapper need data to be split on the basis of some key so

either we have to specifically define --m 1 option to say that only one mapper will perform this operation

or we have to specify any other key (by using sqoop property --split-by ) through with you can uniquely identify the data then you can use

查看更多
登录 后发表回答