Join Tables on Date Range in Hive

2019-06-22 08:10发布

问题:

I need to join tableA to tableB on employee_id and the cal_date from table A need to be between date start and date end from table B. I ran below query and received below error message, Would you please help me to correct and query. Thank you for you help!

Both left and right aliases encountered in JOIN 'date_start'.

select a.*, b.skill_group 
from tableA a 
  left join tableB b 
    on a.employee_id= b.employee_id 
    and a.cal_date >= b.date_start 
    and a.cal_date <= b.date_end

回答1:

RTFM - quoting LanguageManual Joins

Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.

You may try to move the BETWEEN filter to a WHERE clause, resulting in a lousy partially-cartesian-join followed by a post-processing cleanup. Yuck. Depending on the actual cardinality of your "skill group" table, it may work fast - or take whole days.



回答2:

If your situation allows, do it in two queries.

First with the full join, which can have the range; Then with an outer join, matching on all the columns, but include a where clause for where one of the fields is null.

Ex:

create table tableC as
select a.*, b.skill_group 
    from tableA a 
    ,    tableB b 
    where a.employee_id= b.employee_id 
      and a.cal_date >= b.date_start 
      and a.cal_date <= b.date_end;

with c as (select * from TableC)
insert into tableC
select a.*, cast(null as string) as skill_group
from tableA a 
  left join c
    on (a.employee_id= c.employee_id 
    and a.cal_date  = c.cal_date)
where c.employee_id is null ;