Need advice on Sqoop Incremental Imports.
Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2, the same customer adds Policy 2 and after the incremental import sqoop run, will we get only new records in the part files?
In that case, How do I get the Old and Incremental appended/last modified records using Sqoop?
相关问题
- Sqoop job to import data from sql server ignores s
- Exporting HBase table to mysql
- Sqoop Import from Hive to Hive
- Data import from MySQL with Sqoop - Error : No man
- Sqoop Export specific columns from hdfs to mysql i
相关文章
- How to create external table in Hive using sqoop.
- ERROR hive.HiveConfig: Could not load org.apache.h
- SQOOP SQLSERVER Failed to load driver “ appropriat
- Import data from HDFS to HBase (cdh3u2)
- Sqoop import having SQL query with where clause
- How to copy data from one HDFS to another HDFS?
- Sqoop Incremental Import
- Extended ASCII characters in Oracle Text blob not
In such use cases always look for fields which are genuinely incremental in nature for incremental append. and for last modified look best suited field is modified_date or likewise some fields for those which have been changed since you sqoop-ed them. only those and those rows will be updated, adding newer rows in your hdfs location requires incremental append.
let's take example here, you are having customer table with two columns cust_id and policy, also custid is your primary key and you just want to insert data cust id 100 onward
scenario 1:- append new data on the basis of cust_id field
phase1:-
below 3 records are there which are inserted recently in customer table which we want to import in HDFS
here is sqoop command for that
phase2:- below 4 records are there which are inserted recently in customer table which we want to import in HDFS
here is sqoop command for that
so these four properties we will have to cosider for inserting new records
scenario 2:- append new data +update existing data on the basis of cust_id field
below 1 new record with cust id 108 has inserted and cust id 101 and 102 has updated recently in customer table which we want to import in HDFS
so these four properties we will have to cosider for insert/update records in same command
I am specifically mentioning primary key as if table is not having primary key then few more properties needs to be consider which are:-
multiple mapper perform the sqoop job by default so mapper need data to be split on the basis of some key so
either we have to specifically define --m 1 option to say that only one mapper will perform this operation
or we have to specify any other key (by using sqoop property --split-by ) through with you can uniquely identify the data then you can use