Apache Hive regEx serde: data types

2019-03-31 04:00发布

问题:

For processing logs I want to use Apache Hive regEx serde but I only found examples that use String as datatype for the columns of the table.

Now my question is: are datebased types and integers and arrays supported or is it just strings?

This example (and others) only uses strings:

CREATE TABLE access_log (
  remote_ip STRING,
  request_date STRING,
  method STRING,
  request STRING,
  protocol STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "([^ ]) . . [([^]]+)] \"([^ ]) ([^ ]) ([^ \"])\" *",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s"
)
STORED AS TEXTFILE

;

回答1:

  • Refer the code of SERDE : code of RegexSerDe or github - RegexSerDe code, All columns have to be of type STRING. -- from program comment
  • If you want to do some tweak to it, write some custom Serde code(if you are good at java , then proceed ) and add as a custom serde jar like this example csv custom serde
  • If not, let the columns type be STRING only, and when you want to act upon any column use Casting ( cast() function in hive ) in query.

hope this helps :)



回答2:

I haven't used the RegexSerDe personally, but I do notice that there are two classes for it: serde/src/java/org/apache/hadoop/hive/serde2/RegexSerDe.java contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

The second one, which you are referring to, does indeed appear to be restricted to strings. The other appears to support primitive types.

For whatever reason I only see the second one referenced in the API docs.