Hadoop Hive UDF with external library

2019-08-05 01:27发布

I'm trying to write a UDF for Hadoop Hive, that parses User Agents. Following code works fine on my local machine, but on Hadoop I'm getting:

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String MyUDF .evaluate(java.lang.String) throws org.apache.hadoop.hive.ql.metadata.HiveException on object MyUDF@64ca8bfb of class MyUDF with arguments {All Occupations:java.lang.String} of size 1',

Code:

import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.*;
import com.decibel.uasparser.OnlineUpdater;
import com.decibel.uasparser.UASparser;
import com.decibel.uasparser.UserAgentInfo;

public class MyUDF extends UDF {

    public String evaluate(String i) {
        UASparser parser = null;         
        parser = new UASparser(); 
        String key = "";
        OnlineUpdater update = new OnlineUpdater(parser, key);
        UserAgentInfo info = null;
        info = parser.parse(i);
        return info.getDeviceType();
    }
}

Facts that come to my mind I should mention:

  • I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar option

  • I'm uploading this "fat jar" file with Hue

  • Minimum working example I managed to run:

    public String evaluate(String i) { return "hello" + i.toString()"; }

  • I guess the problem lies somewhere around that library (downloaded from https://udger.com) I'm using, but I have no idea where.

Any suggestions?

Thanks, Michal

2条回答
欢心
2楼-- · 2019-08-05 01:44

such a problem probably can be solved by steps:

  1. overide the method UDF.getRequiredJars(), make it returning a hdfs file path list which values are determined by where you put the following xxx_lib folder into your hdfs. Note that , the list mist exactly contains each jar's full hdfs path strings ,such as hdfs://yourcluster/some_path/xxx_lib/some.jar

  2. export your udf code by following "Runnable jar file exporting wizard" (chose "copy required libraries into a sub folder next to the generated jar". This steps will result in a xxx.jar and a lib folder xxx_lib next to xxx.jar

  3. put xxx.jar and the folders xxx_lib to your hdfs filesystem according to your code in step 0.

  4. create a udf using: add jar ${the-xxx.jar-hdfs-path}; create function your-function as $}qualified name of udf class};

Try it. I test this and it works

查看更多
Evening l夕情丶
3楼-- · 2019-08-05 01:52

It could be a few things. Best thing is to check the logs, but here's a list of a few quick things you can check in a minute.

  1. jar does not contain all dependencies. I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. You can do

    jar tf your-udf-jar.jar

to see what was included. You should see stuff from com.decibel.uasparser. If not, you have to build the jar with the appropriate dependencies (usually you do that using maven).

  1. Different version of the JVM. If you compile with jdk8 and the cluster runs jdk7, it would also fail

  2. Hive version. Sometimes the Hive APIs change slightly, enough to be incompatible. Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the cluster

  3. You should always check if info is null after the call to parse()

  4. looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. Even more important, the library updates online, contacting the online service for each record. This means, looking at the code, that it will create one update thread per record. You should change the code to do that only once in the constructor like the following:

Here's how to change it:

public class MyUDF extends UDF {
  UASparser parser = new UASparser();

  public MyUDF() {
    super()
    String key = "PUT YOUR KEY HERE";
    // update only once, when the UDF is instantiated
    OnlineUpdater update = new OnlineUpdater(parser, key);
  }

  public String evaluate(String i) {
        UserAgentInfo info = parser.parse(i);
        if(info!=null) return info.getDeviceType();
        // you want it to return null if it's unparseable
        // otherwise one bad record will stop your processing
        // with an exception
        else return null; 
    }
}

But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation).

查看更多
登录 后发表回答