How do I make sure RegexSerDe is available to my H

I'm trying to attack the problem of analyzing web logs with Hive, and I've seen plenty of examples out there, but I can't seem to find anyone with this specific issue.

Here's where I'm at: I've set up an AWS ElasticMapReduce cluster, I can log in, and I fire up Hive. I make sure to add jar hive-contrib-0.8.1.jar, and it says it's loaded. I create a table called event_log_raw, with a few string columns and a regex. load data inpath '/user/hadoop/tmp overwrite into table event_log_raw, and I'm off to the races. select * from event_log_raw works (I think locally, as I don't get the map % and reduce % outputs), and I get my 10 records from my sample data, parsed correctly, everything's good. select count(*) from event_log_raw works as well, this time with a mapreduce job created.

I want to convert my request_url field to a map, so I run:

select elr.view_time as event_time, elr.ip as ip, 
str_to_map(split(elr.request_url," ")[1],"&","=") as params 
from event_log_raw elr

Mapreduce fires up, waiting, waiting...FAILED.

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL

I check the syslogs from the task trackers and see, among other things,

java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
<snip>
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:406)
at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:90)
... 22 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe

I've google'd and SO'ed this, but I guess my google-fu is not up to snuff. Everything I've found points to folks having trouble with this and solving it by running the add jar command. I've tried that, I've tried adding it to my hive-site.xml, I've tried having it locally, tried putting the jar in an s3 bucket. Tried adding a bootstrap step to add it during the bootstrap phase (disaster).

Can anyone help me figure out a.) why my task nodes can't find RegexSerDe, and b.) how to make this work? Links are welcome as well, if they might reveal something more than just running add jar.

Thanks in advance!

回答1:

The easiest way to fix this is to add all these jars to hadoop's lib directory on all the task-trackers, we do this with a bunch of stuff:

scp library.jar task-tracker-1:~/<HADOOP_HOME>/lib/

or with EMR in the bootstrap script:

s3cmd get s3://path/to/lib.jar /home/hadoop/lib/

When we used EMR, we just had a s3 directory full of jars that we would sync to the hadoop lib directory:

s3cmd sync s3://static/jars/ /home/hadoop/jars
cp jars/*.jar lib/

If you use oozie, you could also put the jars in the oozie.share.lib directory.

回答2:

I copied the serde jar file to the

hadoop/lib

directory and also did a restart of the hadoop (or even the server) to really work.

回答3:

I guess all you need is to add this jar file into HIVE_AUX_JARS_PATH variable for e.g.

If your hive-contrib-0.8.1.jar is at /usr/hive/lib then run

export HIVE_AUX_JARS_PATH=/usr/hive/lib/hive-contrib-0.8.1.jar:$HIVE_AUX_JARS_PATH

or if HIVE_AUX_JARS_PATH does not exist , just run

export HIVE_AUX_JARS_PATH=/usr/hive/lib/hive-contrib-0.8.1.jar.

After that start the hive session and you will see that everything just works fine.

In case you need this variable permanently put this into .profile file or .bash_profile based on your operating system