I'm trying to create some UDF for Hive which is giving me some more functionality than the already provided split()
function.
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class LowerCase extends UDF {
public Text evaluate(final Text text) {
return new Text(stemWord(text.toString()));
}
/**
* Stems words to normal form.
*
* @param word
* @return Stemmed word.
*/
private String stemWord(String word) {
word = word.toLowerCase();
// Remove special characters
// Porter stemmer
// ...
return word;
}
}
This is working in Hive. I export this class into a jar file. Then I load it into Hive with
add jar /path/to/myJar.jar;
and create a function using
create temporary function lower_case as 'LowerCase';
I've got a table with a String field in it. The statement is then:
select lower_case(text) from documents;
But now I want to create a function returning an array (as e.g. split does).
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class WordSplit extends UDF {
public Text[] evaluate(final Text text) {
List<Text> splitList = new ArrayList<>();
StringTokenizer tokenizer = new StringTokenizer(text.toString());
while (tokenizer.hasMoreElements()) {
Text word = new Text(stemWord((String) tokenizer.nextElement()));
splitList.add(word);
}
return splitList.toArray(new Text[splitList.size()]);
}
/**
* Stems words to normal form.
*
* @param word
* @return Stemmed word.
*/
private String stemWord(String word) {
word = word.toLowerCase();
// Remove special characters
// Porter stemmer
// ...
return word;
}
}
Unfortunately this function does not work if I do the exact same loading procedure mentioned above. I'm getting the following error:
FAILED: SemanticException java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<>' but '>' is found.
As I haven't found any documentation mentioning this kind of transformation, I'm hoping that you will have some advice for me!
I don't think 'UDF' interface will provide what you want. You want to use GenericUDF. I would use the source of the split UDF as a guide.
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-exec/0.7.1-cdh3u1/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSplit.java
Actually the 'UDF' interface does support returning an array.
Return
ArrayList<Text>
or evenArrayList<String>
instead ofText[]
Your code should look like this: