Need advice on Sqoop Incremental Imports.
Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2, the same customer adds Policy 2 and after the incremental import sqoop run, will we get only new records in the part files?
In that case, How do I get the Old and Incremental appended/last modified records using Sqoop?
相关问题
- Sqoop job to import data from sql server ignores s
- Exporting HBase table to mysql
- Sqoop Import from Hive to Hive
- Data import from MySQL with Sqoop - Error : No man
- Sqoop Export specific columns from hdfs to mysql i
相关文章
- How to create external table in Hive using sqoop.
- ERROR hive.HiveConfig: Could not load org.apache.h
- SQOOP SQLSERVER Failed to load driver “ appropriat
- Import data from HDFS to HBase (cdh3u2)
- Sqoop import having SQL query with where clause
- How to copy data from one HDFS to another HDFS?
- Sqoop Incremental Import
- Extended ASCII characters in Oracle Text blob not
There are already great responses here. Along with these you could also try Sqoop Query Approach. You can customize your query based on the condition to retrieve the updated records.
STEP 1: Importing New Records from the Database Table:
Example 1:
$ sqoop import \ --query 'SELECT a., b. FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ --split-by a.id --target-dir /tmp/MyNewloc
Example 2:
Don't forget to supply $CONDITIONS in the Where Clause.
Please Refer Sqoop Free Form Import
STEP 2: Merging part-m files of both base table (original data) & New Table (New Records)
You could do this using 2 methods.
Method 1 - Using Sqoop Merge
Method 2 - Copying newly generated part-m files into original table target directory. (Copy part-m files from /tmp/MyNewloc to /tmp/MyOriginalLoc/)
STEP 3: CREATING HIVE TABLE
1) Now crate a hive table using Location as original table target directory which contains both original part-m files and new records part-m files.
You could also try a free form query which is going to be altered based on a specific condition. You could write a Java code using Sqoop Client to do the same : How to use Sqoop in Java Program?
In answer to your first question, it depends on how you run the import statement. If you use the
--incremental append
option, you would be specifying your--check-column
and--last-value
arguments. These will dictate exactly which records are pulled and they will simply be appended to your table. For example: you could specify aDATE
type column for your--check-column
argument and a very early date (like '1900-01-01' or Day1 in your case) for--last-value
and this would just keep appending everything in the source table (creating duplicate rows) to your destination. In this case, the new part files created will hold both new and old records. You could also use an increasing ID column and keep entering the small ID and that would have the same effect. However, if--last-value
is Day2, there will be additional part files with only new records. I'm not sure if you were wondering if you would lose the old records (just in case you were) but that's not the case.The
last-modified
argument for--incremental
would only be useful if, in the future, you go back and update some of the attributes of an existing row. In this case, it replaces the old data in your table (and adds the new stuff) with the updated version of the row that's now in your source table. Hope this helps!Oh, all of this is based on The Sqoop User Guide Section 7.2.7 https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
and Chapter 3 of the Apache Sqoop Cookbook (that chapter is actually fantastic!)
Here's a step by step guide for Sqoop incremental imports.
For an overview, you use append mode only when the rows in your source table do not update or you don't care about the updates, however you use lastmodified when you want to update the already imported data as well.
Step1 : The entire table is imported. This will be available as part-m file in your specified HDFS location (say /user/abc/def/part-m-00000) Step2 : Only the incremental records are imported. This will be available in another location (say /user/abc/def1/part-m-00000)
Now that both the data are available, you can use the sqoop merge option to consolidate both based on the key column.
Refer to the below doc. for more details
https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_literal_sqoop_merge_literal
Consider a table with 3 records which you already imported to hdfs using sqoop
Now you have additional records in the table but no updates on existing records
Here you should use an
--incremental append
with--check-column
which specifies the column to be examined when determining which rows to import.The above code will insert all the new rows based on the last value.
Now we can think of second case where there are updates in rows
Here we use incremental lastmodified where we will fetch all the updated rows based on date.