I am implementing a left join functionality using map reduce. Left side is having around 600 million records and right side is having around 23 million records. In mapper I am making the keys using the columns used in left join condition and passing the key-value output from mapper to reducer. I am getting performance issue because of few mapper keys for which number of values in both the tables are high (eg. 456789 and 78960 respectively). Even though other reducers finish their job, these reducers keep running for longer time. Is there any way that multiple reducers can work on the same key-value output from mapper in parallel to better the performance?
This is the Hive query that i want to optimize.
select distinct
a.sequence,
a.fr_nbr,
b.to_nbr,
a.fr_radius,
a.fr_zip,
a.latitude as fr_latitude,
a.longitude as fr_longitude,
a.to_zip,
b.latitude as to_latitude,
b.longitude as to_longitude,
((2 * asin( sqrt( cos(radians(a.latitude)) * cos(radians(b.latitude)) * pow(sin(radians((a.longitude - b.longitude)/2)), 2) + pow(sin(radians((a.latitude - b.latitude)/2)), 2) ) )) * 6371 * 0.621371) as distance,
a.load_year,
a.load_month
from common.sb_p1 a LEFT JOIN common.sb__temp0u b
on a.to_zip=b.zip
and a.load_year=b.load_year
and a.load_month=b.load_month
where b.correction = 0
and a.fr_nbr <> b.to_nbr
and ((2 * asin( sqrt( cos(radians(a.latitude)) * cos(radians(b.latitude)) * pow(sin(radians((a.longitude - b.longitude)/2)), 2) + pow(sin(radians((a.latitude - b.latitude)/2)), 2) ) )) * 6371 * 0.621371 <= a.fr_radius)
Any other solution will also be appreciated.
You can also consider using HiveQL for this. Its pretty much meant for situations like the one you have mentioned above and takes care of complexity of map reduce implementation.
Split the skewed keys using
UNION ALL
:These subqueries will run in parallel, skewed keys will not be distributed to single reducer