SQL Query JOIN with Table

2020-01-29 02:14发布

问题:

select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as 
timestamps from testingtable2 LATERAL VIEW explode(purchased_item) exploded_table
as prod_and_ts;

By using the above query, I am getting the below output.

USER_ID     |    PRODUCT_ID    |   TIMESTAMPS
------------+------------------+-------------
1015826235       220003038067      1004841621
1015826235       300003861266      1005268799
1015826235       140002997245      1061569397
1015826235      *200002448035*     1005542471

If you compare the above output from the query with the below Table2 data, then the product_id in the last line of above output is not matching with the ITEM_ID in the last line in the below Table2 data.

BUYER_ID     |    ITEM_ID        |     CREATED_TIME 
-------------+-------------------+------------------------
1015826235       220003038067        2001-11-03 19:40:21
1015826235       300003861266        2001-11-08 18:19:59
1015826235       140002997245        2003-08-22 09:23:17
1015826235      *210002448035*       2001-11-11 22:21:11

So my question is

Find all those PRODUCT_ID(ITEM_ID) and TIMESTAMPS(CREATED_TIME) that are not matching with Table2 data corresponding to particular BUYER_ID or USER_ID.

So I need to show the result like this for the above example-

BUYER_ID   |     ITEM_ID       |      CREATED_TIME       |     USER_ID   |       PRODUCT_ID     |   TIMESTAMPS
-----------+-------------------+-------------------------+---------------+------------------+------------------
1015826235     *210002448035*       2001-11-11 22:21:11     1015826235      *200002448035*     1005542471

I need to JOIN the above query that I wrote with table2 to get the above result. So I need to use my above query in the JOINING process. That is confusing me a lot. Any suggestion will be appreciated.

UPDATE:-

I wrote the below query, but somehow I am not able to achieve the output that I wanted to achieve. Can anyone help me with this?

SELECT table2.buyer_id, table2.item_id, table2.created_time from 
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as 
timestamps from testingtable2 LATERAL VIEW explode(purchased_item) exploded_table
as prod_and_ts) prod_and_ts JOIN table2 where 
prod_and_ts.user_id = table2.buyer_id
and (product_id <> table2.item_id or 
timestamps <> UNIX_TIMESTAMP(table2.created_time));

回答1:

I think you can do what you want with two queries, but I'm not 100% sure. Often in this situation, it is sufficient to find things in the first table that don't match in the second table. You are also trying to get a "closest" match, which is why this is challenging.

The following query looks for matches on user id and exactly one of the other two fields, and then combines them:

SELECT table2.buyer_id, table2.item_id, table2.created_time, prod_and_ts.*
from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
      from testingtable2 LATERAL VIEW
           explode(purchased_item) exploded_table as prod_and_ts
     ) prod_and_ts JOIN
     table2
     on prod_and_ts.user_id = table2.buyer_id and
        prod_and_ts.product_id = table2.item_id and
        prod_and_ts.timestamps <> UNIX_TIMESTAMP(table2.created_time)
union all
SELECT table2.buyer_id, table2.item_id, table2.created_time, prod_and_ts.*
from (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
      from testingtable2 LATERAL VIEW
           explode(purchased_item) exploded_table as prod_and_ts
     ) prod_and_ts JOIN
     table2
     on prod_and_ts.user_id = table2.buyer_id and
        prod_and_ts.product_id <> table2.item_id and
        prod_and_ts.timestamps = UNIX_TIMESTAMP(table2.created_time)

This will not find situations where there is no match on either field.

Also, I've written this using the "on" syntax rather than "where". I assume HIVE supports this.



回答2:

Your rep is too high to open a duplicate and especially 2 duplicates of the same question.

Joining two Tables in Hive using HiveQL(Hadoop)

Join Two Tables and get the output from both of them

You don't have enough info to tie the records back for the third scenario.
You can do a FULL OUTER JOIN with an OR and get everything back, match the rows that you have enough info on as in the first and second case you list, and identify recs that you don't by returning rows with nulls for the fields from the non matching table in the third scenario.

SELECT DATEPART(d,B.T1time),DATEPART(d,A.Created_TIME),* 
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] 
AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)

Trying to match on the third scenario is a hack - the info is not there

This will match them with any for the date specified that aren't matching on the other days, but again you will get Cartesian products.

SELECT DATEPART(d,B.T1time),DATEPART(d,A.Created_TIME),* 
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] 
AND (
    (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
    OR
    (
        (A.Created_TIME <>  B.t1time AND B.PRODUCTID <> A.ITEM_ID AND DATEPART(d,B.T1time) = DATEPART(d,A.Created_TIME)) 
        AND a.ITEM_ID NOT IN(SELECT ITEM_ID
                   FROM SO_Table1HIVE A2
                   INNER JOIN SO_Table2HIVE B2 ON A2.BUYER_ID = B2.[USER_ID] AND (A2.Created_TIME =  B2.t1time OR B2.PRODUCTID = A2.ITEM_ID)
                   )

        AND B.PRODUCTID NOT IN(SELECT PRODUCTID
                   FROM SO_Table1HIVE A2
                   INNER JOIN SO_Table2HIVE B2 ON A2.BUYER_ID = B2.[USER_ID] AND (A2.Created_TIME =  B2.t1time OR B2.PRODUCTID = A2.ITEM_ID)
                   )
    )

)

You could use RANK() or try a top one, etc. RANK() or ROW_NO would probably be the best of these hacks if this were not a Hive question, but as I know your are using HQL, I am not going to write it up. You could pull them out in a separate table and run some logical update queries to update it, then use that as a lookup table to tie back.

 tbl1Tbl2Lookup
 ---------------
 id int identity
 table1info FK
 table2info FK 

What you should probably do is what the person in the question you offered a bounty on suggested - since you really don't have a good way to query the third scenario and they offered you an alternative that is specific to HIVE.



标签: sql join hive