I'm trying to write a UDF for Hadoop Hive, that parses User Agents. Following code works fine on my local machine, but on Hadoop I'm getting:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String MyUDF .evaluate(java.lang.String) throws org.apache.hadoop.hive.ql.metadata.HiveException on object MyUDF@64ca8bfb of class MyUDF with arguments {All Occupations:java.lang.String} of size 1',
Code:
import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.*;
import com.decibel.uasparser.OnlineUpdater;
import com.decibel.uasparser.UASparser;
import com.decibel.uasparser.UserAgentInfo;
public class MyUDF extends UDF {
public String evaluate(String i) {
UASparser parser = null;
parser = new UASparser();
String key = "";
OnlineUpdater update = new OnlineUpdater(parser, key);
UserAgentInfo info = null;
info = parser.parse(i);
return info.getDeviceType();
}
}
Facts that come to my mind I should mention:
I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar option
I'm uploading this "fat jar" file with Hue
Minimum working example I managed to run:
public String evaluate(String i) { return "hello" + i.toString()"; }
I guess the problem lies somewhere around that library (downloaded from https://udger.com) I'm using, but I have no idea where.
Any suggestions?
Thanks, Michal
such a problem probably can be solved by steps:
overide the method
UDF.getRequiredJars()
, make it returning ahdfs
file path list which values are determined by where you put the following xxx_lib folder into your hdfs. Note that , the list mist exactly contains each jar's full hdfs path strings ,such ashdfs://yourcluster/some_path/xxx_lib/some.jar
export your
udf
code by following "Runnable jar file exporting wizard" (chose "copy required libraries into a sub folder next to the generated jar". This steps will result in a xxx.jar and a lib folder xxx_lib next to xxx.jarput xxx.jar and the folders xxx_lib to your hdfs filesystem according to your code in step 0.
create a udf using: add jar ${the-xxx.jar-hdfs-path}; create function your-function as $}qualified name of udf class};
Try it. I test this and it works
It could be a few things. Best thing is to check the logs, but here's a list of a few quick things you can check in a minute.
jar does not contain all dependencies. I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. You can do
jar tf your-udf-jar.jar
to see what was included. You should see stuff from
com.decibel.uasparser
. If not, you have to build the jar with the appropriate dependencies (usually you do that using maven).Different version of the JVM. If you compile with jdk8 and the cluster runs jdk7, it would also fail
Hive version. Sometimes the Hive APIs change slightly, enough to be incompatible. Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the cluster
You should always check if
info
is null after the call toparse()
looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. Even more important, the library updates online, contacting the online service for each record. This means, looking at the code, that it will create one update thread per record. You should change the code to do that only once in the constructor like the following:
Here's how to change it:
But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation).