How to optimize scan of 1 huge file / table in Hiv

2019-02-28 09:41发布

问题:

I am currently trying to associate each lat long ping from a device to its ZIP code.

I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.

Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.

Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.

set hive.vectorized.execution.enabled = true;

set hive.execution.engine=tez;

set hive.enforce.sorting=true;

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

analyze table tele_us_zipmatch compute statistics for columns;

CREATE TABLE zipcheck (

`long4` double,

`lat4` double,

state_name string,

country_code string,

country_name string, region string,

zip int,

countyname string) PARTITIONED by (state_id string)

STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY",

'orc.create.index'='true',

'orc.bloom.filter.columns'='');

INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)

select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch

where ST_Contains(wkt_shape,zip_point)=TRUE;

ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).

Any help is greatly appreciated.

Thanks.

回答1:

If the ZIP-code dataset can fit into memory, try a custom Map-Reduce application that uses a just-in-time in-memory quadtree index on the ZIP-code data, by adapting the sample in the GIS-Tools-for-Hadoop.

[collaborator]