I am trying to use Amazon EMR with Hive to process a rather large number of logfiles generated by ad tracking servers. The performance is far worse than I would expect, and am hoping someone can give me pointers for improvement.
The tracking servers upload log files every few minutes to S3 folders partitioned by day (e.g., "2014-05-20"). There are roughly 3,000 total files uploaded per day, roughly 20k per file.
Using Hive, I have successfully created external tables referencing the data in S3, and set up partitions for 30 days worth of log files. I have verified that the partitioning is working correctly, and simple queries (e.g., "SELECT * FROM click WHERE dt='2014-05-19' LIMIT 10) work correctly and respond quickly.
I am loading the data into temporary HDFS tables for subsequent queries. To do so, I run an HQL job that is essentially this (note that click
is the external table in S3):
CREATE TABLE tmp_click (
clickId string,
-- ...
dt string
)
STORED AS SEQUENCEFILE;
INSERT OVERWRITE TABLE tmp_click
SELECT
clickId,
-- ...
k.dt
FROM
click k
WHERE
k.dt >= '${START_DAY}' AND
k.dt <= '${END_DAY}'
;
This operation takes upwards of an hour with 25 xlarge instances working as core/task nodes. Given that there is basically no processing going on here -- it's just copying the data over, right? -- I feel like there must be something I'm missing. Can anyone give me any tips to investigate?
I've considered that perhaps the large number of files (~3,000 day), or the compression (gz) of the logfiles might be problems, but I have no ability to control the input.