Below is the data in TestingTable1 sorted by date in descending order always
BUYER_ID | ITEM_ID | CREATED_TIME
----------+-----------------+----------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
And if this is the below data in TestingTable2 sorted by date in descending order always
USER_ID | PRODUCT_ID | LAST_TIME
---------+----------------+-----------------------
1345653 110909316904 2012-07-09 22:29:06
1345653 150851771618 2012-07-09 19:57:33
Each row in TestingTable1
should match with TestingTable2
, if doesn't match or data is not there in TestingTable2, then I need to show them in the output as in TestingTable1
I have this data but corresponding to TestingTable2
I have this data(which will be wrong data), so that I can see what is mismatch and what data is missing.
I need to compare TestingTable2
with TestingTable1
on BUYER_ID
and USER_ID
. I need to see, if BUYER_ID
and USER_ID
gets matched then I need to compare ITEM_ID
with PRODUCT_ID
and CREATED_TIME
with LAST_TIME
and if there is a mismatch in TestingTable2
after comparing with TestingTable1
in either one of them or both of them, then I need to show the result.
So if you look at the above example- I have three scenarios basically
- Firstly- In
TestingTable1
, in the First rowITEM_ID
is matching withPRODUCT_ID
in the First row ofTestingTable2
butCREATED_TIME
is not matching withLAST_TIME
for the first row in both the tables - Secondly- In
TestingTable1
, in the Second rowCREATED_TIME
is matching withLAST_TIME
in the second row ofTestingTable2
butITEM_ID
is not matching withPRODUCT_ID
for the second row in both the tables - Thirdly- In
TestingTable1
, last two lines(rows) are not there at all inTestingTable2
. This scenario is not covered in my query that I wrote. I want this scenario also in my query.
So these are the three cases that I need to cover while comparing TestingTable2
with TestingTable1
always. And TestingTable1
is the MAIN table through which comparisons need to be made always, so it means data in TestingTable1
is always accurate.
So I need to show the result like below considering the above example if not matching either one of them or data is not there in TestingTable2
at all- TestingTable1
data then next to it same TestingTable2
data, so that I can see what value was there in TestingTable1
as compared to TestingTable2
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | LAST_TIME
-----------+-----------------+---------------------------+----------------+--------------------+-----------------------
1345653 110909316904 2012-07-09 21:29:06 1345653 110909316904 2012-07-09 22:29:06
1345653 151851771618 2012-07-09 19:57:33 1345653 150851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48 NULL NULL NULL
1345653 400307563710 2012-07-09 18:57:33 NULL NULL NULL
Below is the query I wrote that covers only my two scenarios
that I mentioned above and it works fine and I will get the output like above leaving last two rows from the Output. But I need to add third scenario
also in this(below) query so that it gives output like above.
SELECT *
FROM (SELECT T2.buyer_id,
T2.item_id,
T2.created_time AS created_time,
subq.user_id,
subq.product_id,
subq.LAST_TIME
FROM TestingTable2 subq
JOIN TestingTable1 T2
ON T2.buyer_id = subq.user_id
AND subq.LAST_TIME = ( T2.created_time )
WHERE ( subq.product_id <> T2.item_id )
UNION ALL
SELECT T2.buyer_id,
T2.item_id AS item_id,
T2.created_time,
subq.user_id,
subq.product_id AS product_id,
subq.LAST_TIME
FROM TestingTable2 subq
JOIN TestingTable1 T2
ON T2.buyer_id = subq.user_id
AND subq.product_id = T2.item_id
WHERE ( subq.LAST_TIME <> ( T2.created_time ) )) finalResult
ORDER BY finalResult.BUYER_ID;
Any suggestion will be appreciated.
P.S. I have asked few questions related to JOIN in past few days, but that covers only my two scenarios, not the third scenario that I need in this query.
Update:-
I cannot use NOT IN
or NOT EXISTS
syntax for SQL, as I am working with Hive and Hive doesn't support NOT IN
or NOT EXISTS
that so I need some other way to deal with this.
I need to use my query only to modify it to work for Third Scenario, as that way Hive will support SQL syntax.
Below is my SQL Fiddle, that fulfills my two scenario above but not the third scenario. Can anyone help me to modify my query to work for third scenario as well?
http://sqlfiddle.com/#!3/102dd/1/0.
Data in Table1 should be there in Table2, if it is not there, then I need to show the mismatch between data after comparing from Table1 and also it is possible, data from Table1 will not be there in Table2, and I want to show that too.
Updated Output for the fiddle http://sqlfiddle.com/#!3/102dd/3/0
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | LAST_TIME
-----------+-----------------+---------------------------+----------------+--------------------+-----------------------
1345653 151851771618 July, 09 2012 19:57:33 1345653 150851771618 July, 09 2012 19:57:33
1345653 221065796761 July, 09 2012 19:31:48 1345653 221165796761 July, 09 2012 19:31:48
1345653 110909316904 July, 09 2012 21:29:06 1345653 110909316904 July, 09 2012 22:29:06
1345653 400307563710 July, 09 2012 18:57:33 NULL NULL NULL
1345653 310411560125 July, 09 2012 16:09:49 NULL NULL NULL
UPDATED SQL QUERY THAT IS GIVING ME ERROR
I replaced TestingTable1
with this query-
(SELECT BUYER_ID, ITEM_ID, rank(BUYER_ID), CREATED_TIME
FROM (
SELECT BUYER_ID, ITEM_ID, CREATED_TIME
FROM testingtable1
where to_date(from_unixtime(cast(UNIX_TIMESTAMP(CREATED_TIME) as int))) = '2012-07-09'
DISTRIBUTE BY BUYER_ID
SORT BY BUYER_ID, CREATED_TIME desc
) T1
WHERE rank(BUYER_ID) < 5)
And TestingTable2
with this query-
(SELECT USER_ID, PROD_AND_TS.PRODUCT_ID as PRODUCT_ID, PROD_AND_TS.TIMESTAMPS as TIMESTAMPS FROM testingtable2 lateral view explode(PURCHASED_ITEM) exploded_table as PROD_AND_TS where to_date(from_unixtime(cast(PROD_AND_TS.TIMESTAMPS as BIGINT))) = '2012-07-09')