I have two tables A
and B
, where B
is huge (20 million by 300) and A
is of moderate size (300k by 10). A
contains one column that is address and B
contains 3 columns that can be put together to form a proper street address. For example, in A
, the address column could be:
id | Address
-----------
233 | 123 Main St
and in B
we could have:
Number | Street_name | Street_suffix | Tax
------------------------------------------------
123 | Main | Street | 320.2
I want to join them using string matching similar to LIKE
with something like the following:
select A.id, B.Tax
from A
left join B
on A.Address **contains** B.Number
and A.Address **contains** B.Street_name;
Basically I am trying to match the records by saying that if A
's address contains B
's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).
The 2 contains part is something I am not sure how to implement. Any ideas?
I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).
Hive join have limitation you can use join with equality conditions only.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
If you could logically explode A.Address column in such away that it matches table B column format then you can use join on that column ....
You can use join with equality conditions only, but -
You can cross join and filter.
Demo
First of all, JOINs in hive only work with equality conditions
Refer here
So you can try the where condition with like and concat operation like below