I have two tables A
and B
, where B
is huge (20 million by 300) and A
is of moderate size (300k by 10). A
contains one column that is address and B
contains 3 columns that can be put together to form a proper street address. For example, in A
, the address column could be:
id | Address
-----------
233 | 123 Main St
and in B
we could have:
Number | Street_name | Street_suffix | Tax
------------------------------------------------
123 | Main | Street | 320.2
I want to join them using string matching similar to LIKE
with something like the following:
select A.id, B.Tax
from A
left join B
on A.Address **contains** B.Number
and A.Address **contains** B.Street_name;
Basically I am trying to match the records by saying that if A
's address contains B
's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).
The 2 contains part is something I am not sure how to implement. Any ideas?
I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).
You can use join with equality conditions only, but -
You can cross join and filter.
select A.id, B.Tax
from A cross join B
where concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
;
Demo
hive> create table A (id int,Address string);
OK
hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
OK
hive> insert into A values (233,'123 Main St');
Query ID = ...
OK
hive> insert into B values (123,'Main','Street',320.2);
Query ID = ...
OK
hive> select A.id, B.Tax
> from A cross join B
> where concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
> and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
> ;
Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
Query ID = ...
OK
233 320.2
hive>
Hive join have limitation you can use join with equality conditions only.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
If you could logically explode A.Address column in such away that it matches table B column format then you can use join on that column ....
First of all, JOINs in hive only work with equality conditions
Refer here
So you can try the where condition with like and concat operation like below
select A.id, B.Tax
from A,B where A.Address LIKE CONCAT('%',cast(B.Number as string),'%')
AND
A.Address LIKE CONCAT('%',B.Street_name,'%')