Poor performance on hash joins with Pig on Tez

2019-07-24 23:37发布

I have a series of Pig scripts that are transforming hundreds of millions of records from multiple data sources that need to be joined together. Towards the end of each script, I reach a point where JOIN performance becomes terribly slow. Looking at the DAG in the Tez View, I see that it is split into relatively few tasks (typically 100-200), but each task takes multiple hours to complete. The task description shows that it's doing a HASH_JOIN.

Interestingly, I only run into this bottleneck when running on the Tez execution engine. On MapReduce, it can still take a while, but nothing like the agonizing crawl I get on Tez. However, running on MapReduce is a problem as I have an issue with MapReduce for which I've asked another question here.

Here's a sample of my code (apologies, I've had to make the code very generic to be able to post on the interwebs). I'm wondering what I can do to remove this bottleneck -- would specifying parallelism help? Is there something wrong with my approach?

-- Incoming data:
-- A: hundreds of millions of rows, 19 fields
-- B: hundreds of millions of rows, 3 fields
-- C: hundreds of millions of rows, 5 fields
-- D: a few thousand rows, 5 fields

J = -- This reduces the size of A, but still probably in the hundreds of millions
    FILTER A
    BY qualifying == 1;

K = -- This is a one-to-one join that doesn't explode the number of rows in J
    JOIN J BY Id
       , B BY Id;

L =
    FOREACH K
    GENERATE J1 AS L1
           , J2 AS L2
           , J3 AS L3
           , J4 AS L4
           , J5 AS L5
           , J6 AS L6
           , J7 AS L7
           , J8 AS L8
           , B1 AS L9
           , B2 AS L10
           ;

M = -- Reduces the size of C to around one hundred million rows
    FILTER C
    BY Code matches 'Code-.+';

M_WithYear =
    FOREACH M
    GENERATE *
           , (int)REGEX_EXTRACT(Code, 'Code-.+-([0-9]+)', 1) AS year:int
           ;

SPLIT M_WithYear
    INTO M_annual IF year <= (int)'$currentYear' -- roughly 75% of the data from M
       , M_lifetime IF Code == 'Code-Lifetime'; -- roughly 25% of the data from M

-- Transformations for M_annual

N =
    JOIN M_WithYear BY Id, D BY Id USING 'replicated';

O = -- This is where performance falls apart
    JOIN N BY (Id, year, M7) -- M7 matches L7
       , L BY (Id, year, L7);

P =
    FOREACH O
    GENERATE N1 AS P1
           , N2 AS P2
           , N3 AS P3
           , N4 AS P4
           , N5 AS P5
           , N6 AS P6
           , N7 AS P7
           , N8 AS P8
           , N9 AS P9
           , L1 AS P10
           , L2 AS P11
           ;

-- Transformations N-P above repeated for M_lifetime

0条回答
登录 后发表回答