Possible Duplicate:
SQL Query JOIN with Table
CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable1 (This is the MAIN table through which comparisons need to be made)
(
BUYER_ID BIGINT,
ITEM_ID BIGINT,
CREATED_TIME STRING
)
And this is the data in the above first table
**BUYER_ID** | **ITEM_ID** | **CREATED_TIME**
--------------+------------------+-------------------------
1015826235 220003038067 *2001-11-03 19:40:21*
1015826235 300003861266 2001-11-08 18:19:59
1015826235 140002997245 2003-08-22 09:23:17
1015826235 *210002448035* 2001-11-11 22:21:11
This is Second table in Hive- It also contains information about the items we are purchasing.
CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
And this is the data in the above second table (TestingTable2
)-
**USER_ID** **PURCHASED_ITEM**
1015826235 [{"product_id":220003038067,"timestamps":"1004941621"}, {"product_id":300003861266,"timestamps":"1005268799"}, {"product_id":140002997245,"timestamps":"1061569397"},{"product_id":200002448035,"timestamps":"1005542471"}]
Compare TestingTable2
with TestingTable1
so that below scenario is fulfilled.
Find the PRODUCT_ID
AND TIMESTAMPS
from TestingTable2
WHICH IS NOT MATCHING WITH ITEM_ID
AND CREATED_TIME
from TestingTable1 CORRESPONDING TO BUYER_ID(USER_ID)
after comparing from TestingTable1
.
So If you look TestingTable2
data this(last) ITEM_ID 210002448035
from TestingTable1
is not matching with TestingTable2
PRODUCT_ID- 200002448035
data and similarly with timestamps. So I want to show the below result using the HiveQL query.
**BUYER_ID** | **ITEM_ID** | **CREATED_TIME** | **PRODUCT_ID** | **TIMESTAMPS**
--------------+------------------+--------------------------------+------------------------+----------------------
1015826235 *210002448035* 2001-11-11 22:21:11 200002448035 1005542471
1015826235 220003038067 *2001-11-03 19:40:21* 220003038067 1004941621
Can anyone help me with this. As I am new to HiveQL so having lot of Problem.
Updated:-
I have written this query, but it doesn't working the way I wanted to.
select * from
(select * from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
LEFT OUTER JOIN testingtable1
ON ( prod_and_ts.user_id = testingtable1.buyer_id AND testingtable1.item_id = prod_and_ts.product_id
AND prod_and_ts.timestamps = UNIX_TIMESTAMP (testingtable1.created_time)
)
where testingtable1.buyer_id IS NULL)
set_a LEFT OUTER JOIN testingtable1
ON (set_a.user_id = testingtable1.buyer_id AND
( set_a.product_id = testingtable1.item_id OR set_a.timestamps = UNIX_TIMESTAMP(testingtable1.created_time) )
);
One More UPDATE
As per user1166147
comments. I wrote my query as per his query.
In hive I guess INNER JOIN
are written by simply JOIN
.
This is my below query.
select * from (select t2.buyer_id, t2.item_id, t2.created_time as created_time, subq.user_id, subq.product_id, subq.timestamps as timestamps
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
AND subq.timestamps = unix_timestamp(t2.created_time)
WHERE (subq.product_id <> t2.item_id)
union all
select t2.buyer_id, t2.item_id as item_id, t2.created_time, subq.user_id, subq.product_id as product_id, subq.timestamps
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
and subq.product_id = t2.item_id
WHERE (subq.timestamps <> unix_timestamp(t2.created_time))) unionall;
And after running the above query, I am getting zero result back.
One Final UPDATE:-
My Bad, I didn't have the accurate data in the tables so that is the reason I was not getting the result back. Yes it is working the actual above query.
EDIT - PART 1 Okay - For some reason I am going to explain myself - so to start with I stumbled upon this question because of the SQL tag, and saw Hive, and started to not look and just skip it. BUT then I noticed it had been over a day and you had gotten no answers. I looked - I saw a SQL logic correction in the original query posted that I knew would be needed and would help, so I posted ONLY because no one had answered. I will try to address this last question - but after that I am keeping my advice to myself, as I may be giving bad advice. Good luck! I tried! And you seem to be getting answers now, so...
In TSQL, I could solve this entire problem with the below single query:
It would return everything, including your match buyer_id/user_id only. It won't match a buyer_id/user_id row with no matches in either time or product in the other table, but it will return it as a separate row with NULLS in the other table's fields. I would not match these any way - there is no accurate information provided to do it with as explained below.
END EDIT PART 1
If you can't do
FULL OUTER JOIN
withOR
in Hive, the simplest way to meet the original criteria is toUNION ALL
2INNER JOIN
s. On one of the queries, in addition to joining the matching user_ids, join on the PRODUCT_ID AND in yourWHERE
look for TIMESTAMPS that don't match CREATED_TIME. On the second query, in addition to joining the matching user_ids, join on the times AND in yourWHERE
look for products that don't match.EDIT PART 2 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA
If I understand the last criteria it is any record in either table that has a matching user_id = buyer_id, but nothing else matches. The
FULL OUTER JOIN
withOR
condition will return them, but there isn't enough provided info for a way to relate the records to each other. We can easily identify them, but have no way to tie them back to each other. If you do so and you have more than one record without a match in either OR both tables, there are going to be multiple entries for each.Any query I wrote to try to tie them without more info (and probably with) would be a guess and inaccurate.
For example, in the first table if there were these 2 (sample fake) records with nothing matching in the second except user_id:
AND in table2 - these non matching:
You can identify them, but if you match them without more criteria you get 4 instead of 2:
My suggestion would be simply to identify them and show them, as below.
END EDIT PART 2 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA - PART 1
I am working with TSQL, so I can't test for you an exact query with your syntax, but the concepts of the joins are the same, and this will return what you want. I did take your query and attempt your syntax, modify as needed. I tested in TSQL. You may be able to take this and improve upon it with functionality in HiveQL. There are other ways to do this - but this is the most straightforward and this will translate to HiveQL.
REMOVED, YOU GOT THIS PART AND IT IS INCLUDED LATER
(Again modify syntax as needed)**
And here is my tested TSQL version with my table names for reference:
*EDIT PART 3 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA -PART 2
In TSQL the entire query (no unions) can be run using a
FULL OUTER JOIN
with anOR
condition on the joinIf you can't simply do the above, For the SQL logic for the new criteria - to grab those that don't match from both tables and display them as
NULL
in the other table useRIGHT JOIN
andLEFT JOIN
.RIGHT JOIN
will grab anything in the first table the matches the second and everything in the second, andLEFT
does the opposite. Add the new queries to yourUNION
.TSQL EXAMPLE - MODIFY FOR HIVE
Or, If you wanted to grab them and match them as duplicates add to
UNION
:TSQL
Again, Good luck!
You probably need to use Hive transform functionality and have a custom reducer that does the matching between the records from the two tables: t1 and t2 where t1 is simply TestingTable1 and t2 is
as explained by me in another question of yours.
The above query has 2 distinct portions. The first part is "MAP" and the other is "REDUCE". In between these 2 parts is a phase called shuffle (represented by
CLUSTER BY buyer_id
) that is automatically taken care of my Hive. The Map part of the query reads from tables and also passes an identifier (called id that represents which tables the record is coming from). The Shuffle phase groups all the records per buyer_id. The Reduce phase will take in the all records for a given buyer_id and emit out only the records that satisfy the matching criteria. You will have to write the reducer yourself based on your matching criteria. You can write it in any language of your choice. It's guaranteed that all records that have the same buyer_id will go to the same reducer script.There might be an easier way to do but this is the method I can think of right now. Good luck! To gain further appreciation of why I chose this method, see my recent answer here.