Hive: work around for non equi left join

2020-02-11 06:39发布

问题:

Hive does not support non equi joins: The common work around is to move the join condition to the where clause, which work fine when you want an inner join. but what about a left join?

Contrived example. Let say we have an orderLineItem table, and we need to join to a ProductPrice table that has a productID, price & a date range for which the price applies. We want to join to this where ProductID=ProductID & OrderDate between start and End date. If a productID or a valid date range do not match, I'd still want to see all orderLineItems.

This SQL fiddle is an example of how we'd do this in MSSQL: http://sqlfiddle.com/#!6/fb877/7

Problem If I apply the typical workaround, and move the non equi filter to the where clause, it becomes an inner join. In the case above, in the sql fiddle & below, I have a product ID that is not in the lookup.

Question: Provided hive does not support non eqi-joins, How can a left non-eqi be achieved ?

[SQLFiddle Content]

Tables:

CREATE TABLE OrderLineItem(
  LineItemIDId int IDENTITY(1,1),
  OrderID int  NOT NULL,
  ProductID int NOT NULL,
  OrderDate Date
);


CREATE TABLE ProductPrice(
  ProductID int,
  Cost float,
  startDate  Date,
  EndDate  Date


);

loading The data & how we'd join in MSSQL:

--Old Price. Should be ignored
INSERT INTO ProductPrice(ProductID, COST,startDate,EndDate) VALUES  (1, 50,'12/1/2012','1/1/2013');
INSERT INTO ProductPrice(ProductID, COST,startDate,EndDate) VALUES (2, 55,'12/1/2012','1/1/2013');

--Price for Order 2. Should be applied to Order 1
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(1, 20,'12/1/2013','1/1/2014');
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(2, 25,'12/1/2013','1/1/2014');

--Price for Order 2. Should be applied to Order 2
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(1, 15,'1/2/2014','3/1/2014');
INSERT INTO ProductPrice (ProductID, COST,startDate,EndDate) VALUES(2, 20,'1/2/2014','3/1/2014');


--January 1st 2014 Order
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (1, 1,'1/1/2014') ;
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (1, 2,'1/1/2014');

--Feb 1st 2014 Order
INSERT INTO OrderLineItem(OrderID,ProductID,OrderDate) VALUES (2, 1,'2/1/2014');
INSERT INTO OrderLineItem (OrderID,ProductID,OrderDate) VALUES(2, 2,'2/1/2014');
INSERT INTO OrderLineItem (OrderID,ProductID,OrderDate) VALUES(2, 3,'2/1/2014'); -- no price

SELECT * FROM OrderLineItem;

SELECT * FROM OrderLineItem li LEFT OUTER JOIN  ProductPrice p on
p.ProductID=li.ProductID AND  OrderDate BETWEEN  startDate AND  EndDate;

回答1:

  1. Create a copy of the left table with added serial row numbers:

    CREATE TABLE OrderLineItem_serial AS
    SELECT ROW_NUMBER() OVER() AS serial, * FROM OrderLineItem;
    

    Remark: This may work better for some tables formats (must be WITHOUT COMPRESSION):

    CONCAT(INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE) AS serial
    
  2. Do an inner join:

    CREATE TABLE OrderLineItem_inner AS
    SELECT * FROM OrderLineItem_serial li JOIN ProductPrice p
    on p.ProductID = li.ProductID WHERE OrderDate BETWEEN startDate AND EndDate;
    
  3. Left join by serial:

    SELECT * FROM OrderLineItem_serial li
    LEFT OUTER JOIN OrderLineItem_inner i on li.serial = i.serial;
    


回答2:

Why not use a WHERE clause that allows for NULL cases separately?

SELECT * FROM OrderLineItem li 
LEFT OUTER JOIN  ProductPrice p 
ON p.ProductID=li.ProductID 
WHERE ( StartDate IS NULL OR OrderDate BETWEEN startDate AND EndDate);

That should take care of it - if the left join matches it'll use the date logic, if it doesn't it'll keep the NULL values intact as a left join should.



回答3:

Not sure if you can avoid using a double join:

SELECT * 
FROM OrderLineItem li 
LEFT OUTER JOIN  (
  SELECT p.*
  FROM ProductPrice p
  JOIN OrderLineItem li 
  ON p.ProductID=li.ProductID 
  WHERE OrderDate BETWEEN StartDate AND EndDate ) p
ON p.ProductId = li.ProductID
WHERE StartDate IS NULL OR 
  OrderDate BETWEEN StartDate AND EndDate;

This way if there is a match and StartDate is not null, there has to be a valid start/end date match.



回答4:

Hive 0.10 supports cross joins, so you could handle all your "theta join" (non-equijoin) conditions in the WHERE clause.