Performance tuning for Amazon EMR / Hive processin

2019-07-23 09:19发布

问题:

I am trying to use Amazon EMR with Hive to process a rather large number of logfiles generated by ad tracking servers. The performance is far worse than I would expect, and am hoping someone can give me pointers for improvement.

The tracking servers upload log files every few minutes to S3 folders partitioned by day (e.g., "2014-05-20"). There are roughly 3,000 total files uploaded per day, roughly 20k per file.

Using Hive, I have successfully created external tables referencing the data in S3, and set up partitions for 30 days worth of log files. I have verified that the partitioning is working correctly, and simple queries (e.g., "SELECT * FROM click WHERE dt='2014-05-19' LIMIT 10) work correctly and respond quickly.

I am loading the data into temporary HDFS tables for subsequent queries. To do so, I run an HQL job that is essentially this (note that click is the external table in S3):

CREATE TABLE tmp_click (
    clickId string,
    -- ...
    dt string
)
STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE tmp_click
    SELECT 
        clickId, 
            -- ...
            k.dt
    FROM
        click k
    WHERE
        k.dt >= '${START_DAY}' AND
        k.dt <=  '${END_DAY}'
;

This operation takes upwards of an hour with 25 xlarge instances working as core/task nodes. Given that there is basically no processing going on here -- it's just copying the data over, right? -- I feel like there must be something I'm missing. Can anyone give me any tips to investigate?

I've considered that perhaps the large number of files (~3,000 day), or the compression (gz) of the logfiles might be problems, but I have no ability to control the input.

回答1:

Your query surly has to both deal with S3N protocol listing the files in S3 and handling compression. Try to use s3distcp to copy the files from S3 to HDFS faster and then create a table with the copied files.