Below is the data in TestingTable1
BUYER_ID | ITEM_ID | CREATED_TIME
-----------+-------------------+------------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
1345653 310411560125 2012-07-09 16:09:49
1345653 120945302103 2012-07-09 13:40:23
1345653 261060982989 2012-07-09 09:02:21
Below is the data in TestingTable2
USER_ID | PRODUCT_ID | LAST_TIME
-----------+-------------------+-------------------
1345653 110909316904 2012-07-09 21:30:06
1345653 152851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
I need to Compare TestingTable2
with TestingTable1
on BUYER_ID
and USER_ID
. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2
after comparing from TestingTable1
. I created SQL fiddle for this-
http://sqlfiddle.com/#!3/d87b2/1
If you run my query in the SQL Fiddle, you will get output as-
BUYER_ID ERROR
1345653 5
which is right as last three
rows from TestingTable1
is missing in TestingTable2
and rest two
are mismatch after comparison from TestingTable1
on BUYER_ID
and USER_ID
.
Now the complicated thing is starting.
Problem Statement-
In my current output, I am getting ERROR count as 5
. So if you see first row in both the tables ITEM_ID
and PRODUCT_ID
are same but CREATED_TIME
and LAST_TIME
is not same, and difference between those two times is of only 1 minute
. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range
, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4
because difference is within 15 minutes range
for the first row.
So after taking help from Stack Overflow, I found the solution for this, and below is the sql query that works fine in SQL server(which will give error count as 4) but not in Hive as Hive supports only equality JOINS
and I cannot run the below query in Hive. So I need some other way of doing this problem. Is it possible to do the date difference condition in where clause somehow? Basically how I can rewrite the below SQL query
in some other way such that it would fulfill my all the requirements above.
SELECT TT.BUYER_ID,
COUNT(*)
FROM (
SELECT testingtable1.buyer_id,
testingtable1.item_id,
testingtable1.created_time
FROM testingtable2
RIGHT JOIN testingtable1
ON (
testingtable1.item_id = testingtable2.product_id
AND testingtable1.BUYER_ID = testingtable2.USER_ID
AND ABS(DATEDIFF(mi, testingtable1.created_time, testingtable2.last_time)) <= 15
)
WHERE testingtable2.product_id IS NULL
) TT
GROUP BY TT.BUYER_ID;
Expected Output that I need after implementing the above feature-
BUYER_ID ERROR
1345653 4
UPDATE:-
AS Per Below WEST comment, the output only show ERROR count as 1, but it should be showing as 4. And also after removing the last row he added in his SQL fiddle, its not working and I am getting zero error, which is not right as there is already one error in the time difference.
What if you do an equijoin, and put your time comparison logic inside of a CASE expression with a SUM, instead of a COUNT?
You will need to convert the date arithmetic to whatever hive uses...
Here's a MS Sql server SQLFiddle which gets 4 errors returned.