I have two sets of data both stored in an S3 bucket which I need to process in Hive and store the output back to S3. Sample rows from each datasets are as follows:
DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"}
DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"}
I need to join these two data sets based on the requestId
and output a combined row as:
Output: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126","userAgent":"Mozilla"}
The requestIds in dataset 1 is a proper subset of of the requestids in dataset 2. I am using a LEFT OUTER JOIN
to get my output. Here is a simplified version of my Hive script:
CREATE EXTERNAL TABLE dataset1 (
requestId string,
customerId string,
sessionId string
)
LOCATION 's3://path_to_dataset1/';
CREATE EXTERNAL TABLE dataset2 (
requestId string,
userAgent string
)
LOCATION 's3://path_to_dataset2/';
CREATE EXTERNAL TABLE output (
requestId string,
customerId string,
sessionId string,
userAgent string
)
LOCATION 's3://path_to_output/';
INSERT OVERWRITE TABLE output
SELECT d1.requestId, d1.customerId, d1.sessionId, d2.userAgent
FROM dataset1 d1 LEFT OUTER JOIN dataset2 d2
ON (d1.requestId=d2.requestId);
My question is:
Are there opportunities to optimize this join? Can I use partitioning/bucketing of the tables to run the join faster? I have set hive.auto.convert.join
to true
in my script. What other hive properties should I set to gain better performance for the above queries?
We can improve the performance of joins by enabling Auto Convert Map Joins and enabling optimization of skew joins.
Auto Map-Join is a very useful feature when joining a big table with a small table. if we enable this feature, the small table will be saved in the local cache on each node, and then joined with the big table in the Map phase. Enabling Auto Map Join provides two advantages. First, loading a small table into cache will save read time on each data node. Second, it avoids skew joins in the Hive query, since the join operation has been already done in the Map phase for each block of data.
We can enable optimization of skew joins, i.e. imbalanced joins by setting hive.optimize.skewjoin property to true either via SET command in hive shell or hive-site.xml file.
If tables are bucketed by a particular column and these tables are being used in joins then we can enable bucketed map join to improve the performance.
.
Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez engine with below property from hive shell.
.
Hive converts a query into one or more stages. Stages could be a MapReduce stage, sampling stage, a merge stage, a limit stage. By default, Hive executes these stages one at a time. A particular job may consist of some stages that are not dependent on each other and could be executed in
parallel, possibly allowing the overall job to complete more quickly. Parallel execution can be enabled by setting below properties.
.
Vectorization feature is introduced into hive for the first time in hive-0.13.1 release only. By vectorized query execution, we can improve performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
We can enable vectorized query execution by setting below three properties in either hive shell or hive-site.xml file.
.
Recent Hive releases provided the feature of cost based optimization, one can achieve further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.
cost based optimization can be enabled by setting below properties in hive-site.xml file.