I'm trying to attack the problem of analyzing web logs with Hive, and I've seen plenty of examples out there, but I can't seem to find anyone with this specific issue.
Here's where I'm at: I've set up an AWS ElasticMapReduce cluster, I can log in, and I fire up Hive. I make sure to add jar hive-contrib-0.8.1.jar
, and it says it's loaded. I create a table called event_log_raw
, with a few string columns and a regex. load data inpath '/user/hadoop/tmp overwrite into table event_log_raw
, and I'm off to the races. select * from event_log_raw
works (I think locally, as I don't get the map % and reduce % outputs), and I get my 10 records from my sample data, parsed correctly, everything's good. select count(*) from event_log_raw
works as well, this time with a mapreduce job created.
I want to convert my request_url
field to a map, so I run:
select elr.view_time as event_time, elr.ip as ip,
str_to_map(split(elr.request_url," ")[1],"&","=") as params
from event_log_raw elr
Mapreduce fires up, waiting, waiting...FAILED.
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
I check the syslogs from the task trackers and see, among other things,
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
<snip>
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:406)
at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:90)
... 22 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe
I've google'd and SO'ed this, but I guess my google-fu is not up to snuff. Everything I've found points to folks having trouble with this and solving it by running the add jar
command. I've tried that, I've tried adding it to my hive-site.xml
, I've tried having it locally, tried putting the jar in an s3 bucket. Tried adding a bootstrap step to add it during the bootstrap phase (disaster).
Can anyone help me figure out a.) why my task nodes can't find RegexSerDe, and b.) how to make this work? Links are welcome as well, if they might reveal something more than just running add jar
.
Thanks in advance!