I want to update the existing data if it exists based on some condition(data with higher priority should be updated) and insert new data if not exists.
I have already written a query for this but somehow it is duplicating the number of rows. Here is the full explanation of what I have and what I want to achieve:
What I have: Table 1 - columns - id,info,priority
hive> select * from sample1;
OK
1 123 1.01
2 234 1.02
3 213 1.03
5 213423 1.32
Time taken: 1.217 seconds, Fetched: 4 row(s)
Table 2: columns - id,info,priority
hive> select * from sample2;
OK
1 1234 1.05
2 23412 1.01
3 21 1.05
4 1232 1.1
2 3432423 1.6
3 34324 1.4
What I want is the final table should have only 1 row per id with the data according to the greatest priority:
1 1234 1.05
2 3432423 1.6
3 34324 1.4
4 1232 1.1
5 213423 1.32
The query that I have written is this:
insert overwrite table sample1
select a.id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from sample1 a
join
sample2 b
on a.id=b.id where b.id in (select distinct(id) from sample1)
union all
select * from sample2 where id not in (select distinct(id) from sample1)
union all
select * from sample1 where id not in (select distinct(id) from sample2);
After running this query, I am getting this result:
hive> select * from sample1;
OK
1 1234 1.05
2 234 1.02
3 21 1.05
2 3432423 1.6
3 34324 1.4
5 213423 1.32
4 1232 1.1
How do I modify the present query to achieve the correct result. Is there any other method/process that I can follow to achieve the end result. I am using hadoop 2.5.2 along with HIVE 1.2.1 . I am working on a 6 node cluster with 5 slaves and 1 NN.
Use
FULL JOIN
, it will return all joined rows plus all not joined rows from the left and all not joined rows from the right tables.sample2
table contains duplicated rows perid
, this is why join duplicates rows, userow_number()
analytic function to select only rows with highest priority fromsample2
table:If
sample1
table also contains multiple rows perid
(it is not in your example), apply the same technique using row_number to the table sample1.See also the answer about merge using
full join
: https://stackoverflow.com/a/37744071/2700344Also as of Hive 2.2 you can use ACID Merge, see examples
Since I had multiple id rows for each id, so I firstly consolidated the IDs using a spark script. The solution could be found here : SPARK 2.2.2 - Joining multiple RDDs giving out of memory excepton. Resulting RDD has 124 columns. What should be the optimal joining method? Then I used the query mentioned in the question to get the desired result.
adding to previously good answers! try this also: