Tag huge list of elements with lat/long with large

2019-05-26 23:23发布

问题:

I have a huge list of geolocation events:

Event (1 billion)
------
id
datetime
lat
long

And a list of point of interest loaded from open street map:

POI (1 million)
------
id
tag   (shop, restaurant, etc.)
lat
long

I would like to assign to each to each event the tag of the point of interest. What is the best architecture to achieve this problem? We tried using Google BigQuery but we have to do a cross join and it does not work. We are open to use any other big data system.

回答1:

Using Dataflow you can do a cross join pretty easily using CoGroupByKey. Using this approach only the Event and POI you are joining would need to fit in memory (Dataflow will automatically spill to disk if the list of items for a given key is too large to fit in memory).

Here's some more detail.

  • Create a PCollection of events keyed by latitude and longitude.
  • Create a PCollection of POI keyed by latitude and longitude
  • Use a CoGroupByKey to join the two PCollections.
  • Write a DoFn that processes the CoGbkResult
  • The DoFn would look something like:

    
    PCollection<T> finalResultCollection =
    coGbkResultCollection.apply(ParDo.of(
      new DoFn<KV<K, CoGbkResult>, T>() {
        @Override
        public void processElement(ProcessContext c) {
          KV<K, CoGbkResult> e = c.element();
          // Get all collection 1 values
          Iterable<Event> eventVals = e.getValue().getAll(eventTag);
          // Now get collection 2 values
          Iterable<Poi> poiVals = e.getValue().getAll(poiTag);
          for (Event e : eventVals) {
            for (Poi p : poiVal) {
              ...
              c.output(...tagged event...);
            }
          }
        }
      }));
    

As discussed in this Answer you could also use a side input to pass a map whose keys were latitude and longitude and the values were the details of a POI. That approach will work if the data can fit in memory. If you only have 1 million POI and you are only storing the fields listed it will probably fit in memory.

Note: I'm on the Dataflow team.



回答2:

Can you open up the dataset containing the open street map table? (it's open data after all). With that table I can try to optimize the query.

For a similar question, I optimized it via generating keys for each row that can be used to avoid having to CROSS JOIN over the whole dataset.

http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

StackO: How to improve performance of GeoIP query in BigQuery?