Join table by string matching in Hive or Impala or

I have two tables A and B, where B is huge (20 million by 300) and A is of moderate size (300k by 10). A contains one column that is address and B contains 3 columns that can be put together to form a proper street address. For example, in A, the address column could be:

id  | Address
-----------
233 | 123 Main St

and in B we could have:

Number  |  Street_name  | Street_suffix | Tax
------------------------------------------------
123     |  Main         | Street        | 320.2

I want to join them using string matching similar to LIKE with something like the following:

select A.id, B.Tax
from A
  left join B
    on A.Address **contains** B.Number
    and A.Address **contains** B.Street_name;

Basically I am trying to match the records by saying that if A's address contains B's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).

The 2 contains part is something I am not sure how to implement. Any ideas?

I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).

标签： sql hadoop hive apache-pig impala

3条回答

beautiful°

2楼-- · 2019-07-23 16:37

Hive join have limitation you can use join with equality conditions only.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

If you could logically explode A.Address column in such away that it matches table B column format then you can use join on that column ....

0人赞添加讨论(0) 举报

Evening l夕情丶

3楼-- · 2019-07-23 16:56

You can use join with equality conditions only, but -
You can cross join and filter.

select      A.id, B.Tax
from        A cross join B
where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
        and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
;

Demo

hive> create table A (id int,Address string);
OK
hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
OK
hive> insert into A values (233,'123 Main St');
Query ID = ...
OK
hive> insert into B values (123,'Main','Street',320.2);
Query ID = ...
OK
hive> select      A.id, B.Tax
    > from        A cross join B
    > where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
    >         and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
    > ;
Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
Query ID = ...
OK
233 320.2
hive>

0人赞添加讨论(0) 举报

我想做一个坏孩纸

4楼-- · 2019-07-23 16:57

First of all, JOINs in hive only work with equality conditions

Refer here

So you can try the where condition with like and concat operation like below

select A.id, B.Tax
from A,B where A.Address LIKE CONCAT('%',cast(B.Number as string),'%') 
               AND
               A.Address LIKE CONCAT('%',B.Street_name,'%')

0人赞添加讨论(0) 举报

Join table by string matching in Hive or Impala or

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间