Join table by string matching in Hive or Impala or

2019-07-23 16:24发布

I have two tables A and B, where B is huge (20 million by 300) and A is of moderate size (300k by 10). A contains one column that is address and B contains 3 columns that can be put together to form a proper street address. For example, in A, the address column could be:

id  | Address
-----------
233 | 123 Main St

and in B we could have:

Number  |  Street_name  | Street_suffix | Tax
------------------------------------------------
123     |  Main         | Street        | 320.2

I want to join them using string matching similar to LIKE with something like the following:

select A.id, B.Tax
from A
  left join B
    on A.Address **contains** B.Number
    and A.Address **contains** B.Street_name;

Basically I am trying to match the records by saying that if A's address contains B's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).

The 2 contains part is something I am not sure how to implement. Any ideas?

I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).

3条回答
beautiful°
2楼-- · 2019-07-23 16:37

Hive join have limitation you can use join with equality conditions only.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

If you could logically explode A.Address column in such away that it matches table B column format then you can use join on that column ....

查看更多
Evening l夕情丶
3楼-- · 2019-07-23 16:56

You can use join with equality conditions only, but -
You can cross join and filter.

select      A.id, B.Tax
from        A cross join B
where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
        and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
;

Demo

hive> create table A (id int,Address string);
OK
hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
OK
hive> insert into A values (233,'123 Main St');
Query ID = ...
OK
hive> insert into B values (123,'Main','Street',320.2);
Query ID = ...
OK
hive> select      A.id, B.Tax
    > from        A cross join B
    > where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
    >         and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
    > ;
Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
Query ID = ...
OK
233 320.2
hive> 
查看更多
我想做一个坏孩纸
4楼-- · 2019-07-23 16:57

First of all, JOINs in hive only work with equality conditions

Refer here

So you can try the where condition with like and concat operation like below

select A.id, B.Tax
from A,B where A.Address LIKE CONCAT('%',cast(B.Number as string),'%') 
               AND
               A.Address LIKE CONCAT('%',B.Street_name,'%')
查看更多
登录 后发表回答