Hive UDF Text to array

2019-02-07 13:35发布

I'm trying to create some UDF for Hive which is giving me some more functionality than the already provided split() function.

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class LowerCase extends UDF {

  public Text evaluate(final Text text) {
    return new Text(stemWord(text.toString()));
  }

  /**
   * Stems words to normal form.
   * 
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    // Remove special characters
    // Porter stemmer
    // ...
    return word;
  }
}

This is working in Hive. I export this class into a jar file. Then I load it into Hive with

add jar /path/to/myJar.jar;

and create a function using

create temporary function lower_case as 'LowerCase';

I've got a table with a String field in it. The statement is then:

select lower_case(text) from documents;

But now I want to create a function returning an array (as e.g. split does).

import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class WordSplit extends UDF {

  public Text[] evaluate(final Text text) {
    List<Text> splitList = new ArrayList<>();

    StringTokenizer tokenizer = new StringTokenizer(text.toString());

    while (tokenizer.hasMoreElements()) {
      Text word = new Text(stemWord((String) tokenizer.nextElement()));

      splitList.add(word);
    }

    return splitList.toArray(new Text[splitList.size()]);
  }

  /**
   * Stems words to normal form.
   * 
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    // Remove special characters
    // Porter stemmer
    // ...
    return word;
  }
}

Unfortunately this function does not work if I do the exact same loading procedure mentioned above. I'm getting the following error:

FAILED: SemanticException java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<>' but '>' is found.

As I haven't found any documentation mentioning this kind of transformation, I'm hoping that you will have some advice for me!

2条回答
趁早两清
2楼-- · 2019-02-07 13:50

I don't think 'UDF' interface will provide what you want. You want to use GenericUDF. I would use the source of the split UDF as a guide.

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-exec/0.7.1-cdh3u1/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSplit.java

查看更多
戒情不戒烟
3楼-- · 2019-02-07 14:07

Actually the 'UDF' interface does support returning an array.

Return ArrayList<Text> or even ArrayList<String> instead of Text[]

Your code should look like this:

import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class WordSplit extends UDF {

  public ArrayList<String> evaluate(final Text text) {
    ArrayList<String> splitList = new ArrayList<String>();

    StringTokenizer tokenizer = new StringTokenizer(text.toString());

    while (tokenizer.hasMoreElements()) {
      String word = stemWord((String) tokenizer.nextElement());
      splitList.add(word);
    }
    return splitList;
  }

  /**
   * Stems words to normal form.
   *
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    return word;
  }
}
查看更多
登录 后发表回答