How can I join two tables using intervals in Googl

2019-08-02 19:18发布

问题:

You have identified a solution of finding an area within a bounding box /circle using cross join as below:

SELECT A.ID, C.Car 
FROM Cars C 
CROSS JOIN Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
  C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax

at: How to cross join in Big Query using intervals?

however, using cross join for large data sets is blocked by GBQ ops team due to constrains on the infrastructure.
Hence, my question: how could I find set of lat,longs within large data table (table A) that are within another set of bounding boxes , small(table B) ?

My query as below has been blocked:

select a.a1, a.a2 , a.mdl, b.name, count(1) count 
from TableMaster a 
CROSS JOIN places_locations b 
where (a.lat 
    BETWEEN  b.bottom_right_lat AND b.top_left_lat) 
AND (a.long 
    BETWEEN b.top_left_long AND b.bottom_right_long) 
group by ....

TableMaster is 538 GB with 6,658,716,712 rows (cleaned/absolute minimum) places_locations varies per query around 5 to 100kb.

I have tried to adapt fake join based on a template: How to improve performance of GeoIP query in BigQuery?

however, query takes an hour and does not produce any results nor any errors are displayed.

Could you identify a possible path to solve this puzzle at all?

回答1:

The problem you're seeing is that the cross join generates too many intermediate values (6 billion x 1k = 6 trillion).

The way to work around this is to generate fewer outputs. If you have additional filters you can apply, you should try applying them before you do the join. If you could do the group by (or part of it) before the join, that would also help.

Moreover, for doing the lookup, you could do a more coarse-grained lookup first. That is, if you could do an initial cross join with a smaller table that has course grained regions, then you could join against the larger table on region id rather than doing a cross join.



回答2:

okey so fake join does work at the end, solution:

` select a.B, a.C , count(1) count from ( SELECT B,  C, A, lat, long from [GB_Data.PlacesMasterA] WHERE not B

is null) a JOIN (SELECT top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A from [Places.placeABOXA] ) b on a.A=b.A where (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat) AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long) group each by B, C `