Count the mismatch and missing

2019-07-27 08:21发布

问题:

Below is the data in TestingTable1

BUYER_ID   |   ITEM_ID         |    CREATED_TIME
-----------+-------------------+------------------------
1345653        110909316904         2012-07-09 21:29:06
1345653        151851771618         2012-07-09 19:57:33
1345653        221065796761         2012-07-09 19:31:48
1345653        400307563710         2012-07-09 18:57:33
1345653        310411560125         2012-07-09 16:09:49
1345653        120945302103         2012-07-09 13:40:23
1345653        261060982989         2012-07-09 09:02:21

Below is the data in TestingTable2

USER_ID   |   PRODUCT_ID           |    LAST_TIME
-----------+-------------------+-------------------
1345653       110909316904         2012-07-09 21:30:06
1345653       152851771618         2012-07-09 19:57:33
1345653       221065796761         2012-07-09 19:31:48
1345653       400307563710         2012-07-09 18:57:33

I need to Compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2 after comparing from TestingTable1. I created SQL fiddle for this-

http://sqlfiddle.com/#!3/d87b2/1

If you run my query in the SQL Fiddle, you will get output as-

BUYER_ID    ERROR
1345653       5

which is right as last three rows from TestingTable1 is missing in TestingTable2 and rest two are mismatch after comparison from TestingTable1 on BUYER_ID and USER_ID.

Now the complicated thing is starting.

Problem Statement-

In my current output, I am getting ERROR count as 5. So if you see first row in both the tables ITEM_ID and PRODUCT_ID are same but CREATED_TIME and LAST_TIME is not same, and difference between those two times is of only 1 minute. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minute range for the first row.

So how can I implement this feature in my current query? That's my question.

P.S- I am working with Hive and Hive supports sql like syntax. So I think any modification will work in my current query.

回答1:

your SQL Server SQL Fiddle query could be modified as follows and made to work. However, i am not sure if Hive supports datediff

SELECT TT.BUYER_ID , COUNT(*) FROM
(SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM
 testingtable2 RIGHT JOIN testingtable1 
 ON (testingtable1.item_id = testingtable2.product_id 
 AND testingtable1.BUYER_ID = testingtable2.USER_ID 
 AND abs(datediff(mi, testingtable1.created_time,testingtable2.last_time)) <= 15)
 where testingtable2.product_id IS NULL) TT GROUP BY TT.BUYER_ID;