select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as
timestamps from testingtable2 LATERAL VIEW explode(purchased_item) exploded_table
as prod_and_ts;
By using the above query, I am getting the below output.
USER_ID | PRODUCT_ID | TIMESTAMPS
------------+------------------+-------------
1015826235 220003038067 1004841621
1015826235 300003861266 1005268799
1015826235 140002997245 1061569397
1015826235 *200002448035* 1005542471
If you compare the above output from the query with the below Table2 data
, then the product_id
in the last line of above output
is not matching with the ITEM_ID
in the last line in the below Table2
data.
BUYER_ID | ITEM_ID | CREATED_TIME
-------------+-------------------+------------------------
1015826235 220003038067 2001-11-03 19:40:21
1015826235 300003861266 2001-11-08 18:19:59
1015826235 140002997245 2003-08-22 09:23:17
1015826235 *210002448035* 2001-11-11 22:21:11
So my question is
Find all those PRODUCT_ID(ITEM_ID)
and TIMESTAMPS(CREATED_TIME)
that are not matching with Table2
data corresponding to particular BUYER_ID or USER_ID.
So I need to show the result like this for the above example-
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | TIMESTAMPS
-----------+-------------------+-------------------------+---------------+------------------+------------------
1015826235 *210002448035* 2001-11-11 22:21:11 1015826235 *200002448035* 1005542471
I need to JOIN the above query that I wrote with table2 to get the above result. So I need to use my above query in the JOINING process. That is confusing me a lot. Any suggestion will be appreciated.
UPDATE:-
I wrote the below query, but somehow I am not able to achieve the output that I wanted to achieve. Can anyone help me with this?
SELECT table2.buyer_id, table2.item_id, table2.created_time from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as
timestamps from testingtable2 LATERAL VIEW explode(purchased_item) exploded_table
as prod_and_ts) prod_and_ts JOIN table2 where
prod_and_ts.user_id = table2.buyer_id
and (product_id <> table2.item_id or
timestamps <> UNIX_TIMESTAMP(table2.created_time));
I think you can do what you want with two queries, but I'm not 100% sure. Often in this situation, it is sufficient to find things in the first table that don't match in the second table. You are also trying to get a "closest" match, which is why this is challenging.
The following query looks for matches on user id and exactly one of the other two fields, and then combines them:
SELECT table2.buyer_id, table2.item_id, table2.created_time, prod_and_ts.*
from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts
) prod_and_ts JOIN
table2
on prod_and_ts.user_id = table2.buyer_id and
prod_and_ts.product_id = table2.item_id and
prod_and_ts.timestamps <> UNIX_TIMESTAMP(table2.created_time)
union all
SELECT table2.buyer_id, table2.item_id, table2.created_time, prod_and_ts.*
from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts
) prod_and_ts JOIN
table2
on prod_and_ts.user_id = table2.buyer_id and
prod_and_ts.product_id <> table2.item_id and
prod_and_ts.timestamps = UNIX_TIMESTAMP(table2.created_time)
This will not find situations where there is no match on either field.
Also, I've written this using the "on" syntax rather than "where". I assume HIVE supports this.
Your rep is too high to open a duplicate and especially 2 duplicates of the same question.
Joining two Tables in Hive using HiveQL(Hadoop)
Join Two Tables and get the output from both of them
You don't have enough info to tie the records back for the third scenario.
You can do a FULL OUTER JOIN
with an OR
and get everything back, match the rows that you have enough info on as in the first and second case you list, and identify recs that you don't by returning rows with nulls for the fields from the non matching table in the third scenario.
SELECT DATEPART(d,B.T1time),DATEPART(d,A.Created_TIME),*
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID]
AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
Trying to match on the third scenario is a hack - the info is not there
This will match them with any for the date specified that aren't matching on the other days, but again you will get Cartesian products.
SELECT DATEPART(d,B.T1time),DATEPART(d,A.Created_TIME),*
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID]
AND (
(B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
OR
(
(A.Created_TIME <> B.t1time AND B.PRODUCTID <> A.ITEM_ID AND DATEPART(d,B.T1time) = DATEPART(d,A.Created_TIME))
AND a.ITEM_ID NOT IN(SELECT ITEM_ID
FROM SO_Table1HIVE A2
INNER JOIN SO_Table2HIVE B2 ON A2.BUYER_ID = B2.[USER_ID] AND (A2.Created_TIME = B2.t1time OR B2.PRODUCTID = A2.ITEM_ID)
)
AND B.PRODUCTID NOT IN(SELECT PRODUCTID
FROM SO_Table1HIVE A2
INNER JOIN SO_Table2HIVE B2 ON A2.BUYER_ID = B2.[USER_ID] AND (A2.Created_TIME = B2.t1time OR B2.PRODUCTID = A2.ITEM_ID)
)
)
)
You could use RANK()
or try a top one
, etc. RANK()
or ROW_NO
would probably be the best of these hacks if this were not a Hive question, but as I know your are using HQL, I am not going to write it up. You could pull them out in a separate table and run some logical update queries to update it, then use that as a lookup table to tie back.
tbl1Tbl2Lookup
---------------
id int identity
table1info FK
table2info FK
What you should probably do is what the person in the question you offered a bounty on suggested - since you really don't have a good way to query the third scenario and they offered you an alternative that is specific to HIVE.