Hive RegexSerDe

2019-06-14 08:04发布

问题:

I need to read data from a flat file. It contains a number of lines but want to extract data from the line that looks like:

REVISION 12 30364918 Anarchism 2005-12-06T17:44:47Z RJII 141644

I only want the 2nd, 3rd and 5th entries on this line and put them into a Hive table; I have issued this command but get an error

create external table testTable (
tag string, 
a string, 
r string
) 
row format SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
"input.regex" = "REVISION\s,[0-9]*,\s,[0-9]*,\s[a-zA-Z0-9]*\s,[0-9]*-[0-9]*-[0-9]*T[0-9]*:[0-9]*:[0-9]*Z",
"output.format.string" = "%1$s %2$s %3$s") 
stored as textfile 
location 'hdfs://location:8020/user/bd4-project1/enwiki-20080103-sample';

It doesnt seem to work and keeps giving an exception. Any ideas? The regex could be wrong, but i just have no idea

I can post the exception later, dont have access to the cluster at the moment

回答1:

I have test this using Hive 0.10.0, it should work for you.

create table ts_test2(
  tag string, 
  a string, 
  r string
) 
row format SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
"input.regex" = "REVISION ([0-9]+) ([0-9]+) [a-zA-Z0-9]+ ([0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[Z]) RJII [0-9]+$",
"output.format.string" = "%1$s %2$s %3$s");  

Some notes:
1. make sure your regex is exactly correct to macth all the row, or you will get NULL in your hive table. At least test the regex somewhere like this
2. using () to wrap the field you are interested in.
3. I am using space, you can change it to \s (or maybe \\s).