I have the following tables the first (Range
) includes range of values and additional columns:
row | From | To | Country ....
-----|--------|---------|---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...
The From
and To
are bigint
and are exclusive. The Range
table includes 1.8M records.
Additional table (Values
) contains 2.7M records and looks like:
row | Value | More columns....
--------|--------|----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...
I would like to create one table as followed:
row | Value | From | To | More columns....
--------|--------|--------|-------|---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...
I used BETWEEN
for the above task, but the query never ends:
VALUES.VALUE between RANGE.FROM and RANGE.TO
Is there a change I need to do in table partitions or in Impala?
The main idea of the following solution is to replace a theta join (non-equi join) with an equi join that will lead to a good distribution + efficient local join algorithm.
The range (-infinity,infinity) is being split to section of
n
length.Each range from the ranges table is being associated with the sections it intersects.
e.g. given n=1000, the range
[1652,3701]
will be associated with the sections[1000,2000)
,[2000,3000)
and[3000,4000)
(and will have 3 records, 1 for each section)In the same manner a value from the values table is being associated to the range that contains it, e.g.
2093
will be associated with the range[2000,3000)
.The join between the 2 tables is going to be on the value that represents the section, e.g.
[1652,3701]
and2093
are going to be joined on the section[2000,3000)