Ideally, when we run incremental without merge-key
it will create new file with the appended data set but if we use merge-key
then it will create new whole data set including the previous dataset in one file only. But I am not getting one part file when I use incremental append
in my sqoop job. Below are my steps:
1) Initial data:
mysql> select * from departments_per;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
| 2 | Fitness |
| 3 | Footwear |
| 4 | Apparel |
| 5 | Golf |
| 6 | Outdoors |
| 7 | Fan Shop |
+---------------+-----------------+
2) sqoop command to import data into hdfs initially:
sqoop import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_per \
--target-dir /departments \
-m 1
Now when I see the directory departments under hdfs I can see one part file which is fine.
3) Now I update my initial data in mysql:
mysql> select * from departments_demo;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
| 2 | Fitness |
| 3 | Footwear |
| 4 | Apparel |
| 5 | Golf |
| 6 | Outdoors |
| 7 | Fan |
| 8 | Tushar MC |
+---------------+-----------------+
4) Now I create incremental append job and execute it:
sqoop job --create appendJobs12 \
-- import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_demo \
-m 1 \
--target-dir /departments \
--incremental append \
--merge-key department_id \
--check-column department_id \
--last-value 0
sqoop job --exec appendJobs12
5) I can see two part files in hdfs directory even though I used merge-key
concept.
[cloudera@quickstart ~]$ hadoop fs -ls /departments
Found 3 items
-rw-r--r-- 1 cloudera supergroup 0 2018-10-04 00:31 /departments/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 60 2018-10-04 00:31 /departments/part-m-00000
-rw-r--r-- 1 cloudera cloudera 67 2018-10-04 00:37 /departments/part-m-00001
When I display data it looks like below:
[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00000
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00001
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan
8, Tushar MC
where one part file holds the initial data and the second one holds the updated data. Can anybody tell me where am I going wrong because of which I am not able to get one part file with the updated dataset. Thanks in advance