Performance tuning for Amazon EMR / Hive processin

I am trying to use Amazon EMR with Hive to process a rather large number of logfiles generated by ad tracking servers. The performance is far worse than I would expect, and am hoping someone can give me pointers for improvement.

The tracking servers upload log files every few minutes to S3 folders partitioned by day (e.g., "2014-05-20"). There are roughly 3,000 total files uploaded per day, roughly 20k per file.

Using Hive, I have successfully created external tables referencing the data in S3, and set up partitions for 30 days worth of log files. I have verified that the partitioning is working correctly, and simple queries (e.g., "SELECT * FROM click WHERE dt='2014-05-19' LIMIT 10) work correctly and respond quickly.

I am loading the data into temporary HDFS tables for subsequent queries. To do so, I run an HQL job that is essentially this (note that click is the external table in S3):

CREATE TABLE tmp_click (
    clickId string,
    -- ...
    dt string
)
STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE tmp_click
    SELECT 
        clickId, 
            -- ...
            k.dt
    FROM
        click k
    WHERE
        k.dt >= '${START_DAY}' AND
        k.dt <=  '${END_DAY}'
;

This operation takes upwards of an hour with 25 xlarge instances working as core/task nodes. Given that there is basically no processing going on here -- it's just copying the data over, right? -- I feel like there must be something I'm missing. Can anyone give me any tips to investigate?

I've considered that perhaps the large number of files (~3,000 day), or the compression (gz) of the logfiles might be problems, but I have no ability to control the input.

标签： hive amazon-emr

1条回答

家丑人穷心不美

2楼-- · 2019-07-23 09:10

Your query surly has to both deal with S3N protocol listing the files in S3 and handling compression. Try to use s3distcp to copy the files from S3 to HDFS faster and then create a table with the copied files.

0人赞添加讨论(0) 举报

Performance tuning for Amazon EMR / Hive processin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间