Getting Null Values in Hive Create & Load Query wi

2019-05-07 13:51发布

I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.

CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)                        
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (                                             
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",          
"output.format.string" = "%1$s %2$s"                              
)
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;

Please correct me here.

Sample Log:

[INFO_|01/31 07:19:29]  IMEI 356307043180842 
[INFO_|01/31 07:19:33]  PacketLength = 372
[INFO_|01/31 07:19:33]  Recv HEXString

Thanks.

1条回答
祖国的老花朵
2楼-- · 2019-05-07 14:21

With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (\r, \n, or \r\n), before the data ever gets to the SerDe.

Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using STORED AS TEXTFILE. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.

One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.

The way to do this in Hive is to use a different file_format:

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.

Consequently, setting this value to something like EOR would mean that the input files are split on EOR, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.

You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:

SET textinputformat.record.delimiter=EOR;

CREATE EXTERNAL TABLE ...
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = ...
   "output.regex" = ...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;

In your specific scenario, I can't tell what you might use for textinputformat.record.delimiter instead of EOF, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.

If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.

查看更多
登录 后发表回答