Rewrite SQL query to accept date difference condit

2019-08-05 05:53发布

Below is the data in TestingTable1

BUYER_ID   |   ITEM_ID         |    CREATED_TIME
-----------+-------------------+------------------------
1345653        110909316904         2012-07-09 21:29:06
1345653        151851771618         2012-07-09 19:57:33
1345653        221065796761         2012-07-09 19:31:48
1345653        400307563710         2012-07-09 18:57:33
1345653        310411560125         2012-07-09 16:09:49
1345653        120945302103         2012-07-09 13:40:23
1345653        261060982989         2012-07-09 09:02:21

Below is the data in TestingTable2

USER_ID   |   PRODUCT_ID           |    LAST_TIME
-----------+-------------------+-------------------
1345653       110909316904         2012-07-09 21:30:06
1345653       152851771618         2012-07-09 19:57:33
1345653       221065796761         2012-07-09 19:31:48
1345653       400307563710         2012-07-09 18:57:33

I need to Compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2 after comparing from TestingTable1. I created SQL fiddle for this-

http://sqlfiddle.com/#!3/d87b2/1

If you run my query in the SQL Fiddle, you will get output as-

BUYER_ID    ERROR
1345653       5

which is right as last three rows from TestingTable1 is missing in TestingTable2 and rest two are mismatch after comparison from TestingTable1 on BUYER_ID and USER_ID.

Now the complicated thing is starting.

Problem Statement-

In my current output, I am getting ERROR count as 5. So if you see first row in both the tables ITEM_ID and PRODUCT_ID are same but CREATED_TIME and LAST_TIME is not same, and difference between those two times is of only 1 minute. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minutes range for the first row.

So after taking help from Stack Overflow, I found the solution for this, and below is the sql query that works fine in SQL server(which will give error count as 4) but not in Hive as Hive supports only equality JOINS and I cannot run the below query in Hive. So I need some other way of doing this problem. Is it possible to do the date difference condition in where clause somehow? Basically how I can rewrite the below SQL query in some other way such that it would fulfill my all the requirements above.

SELECT  TT.BUYER_ID,
        COUNT(*)
FROM    (
          SELECT    testingtable1.buyer_id,
                    testingtable1.item_id,
                    testingtable1.created_time
          FROM      testingtable2
                    RIGHT JOIN testingtable1
                        ON (
                             testingtable1.item_id = testingtable2.product_id
                             AND testingtable1.BUYER_ID = testingtable2.USER_ID
                             AND ABS(DATEDIFF(mi, testingtable1.created_time, testingtable2.last_time)) <= 15
                           )
          WHERE     testingtable2.product_id IS NULL
        ) TT
GROUP BY TT.BUYER_ID;

Expected Output that I need after implementing the above feature-

BUYER_ID    ERROR
1345653       4

UPDATE:-

AS Per Below WEST comment, the output only show ERROR count as 1, but it should be showing as 4. And also after removing the last row he added in his SQL fiddle, its not working and I am getting zero error, which is not right as there is already one error in the time difference.

1条回答
孤傲高冷的网名
2楼-- · 2019-08-05 06:15

What if you do an equijoin, and put your time comparison logic inside of a CASE expression with a SUM, instead of a COUNT?

SELECT  TT1.BUYER_ID,
        SUM(CASE WHEN ABS(DATEDIFF(mi, TT1.created_time, TT2.last_time)) <= 15 THEN 0
                 ELSE 1
            END) AS ERROR
FROM    testingtable1 TT1
        LEFT JOIN testingtable2 TT2
            ON (
                 TT1.item_id = TT2.product_id
                 AND TT1.BUYER_ID = TT2.USER_ID
               )
GROUP BY TT1.BUYER_ID;

You will need to convert the date arithmetic to whatever hive uses...

Here's a MS Sql server SQLFiddle which gets 4 errors returned.

查看更多
登录 后发表回答