Getting Null Values in Hive Create & Load Query wi

I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.

CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)                        
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (                                             
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",          
"output.format.string" = "%1$s %2$s"                              
)
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;

Please correct me here.

Sample Log:

[INFO_|01/31 07:19:29]  IMEI 356307043180842 
[INFO_|01/31 07:19:33]  PacketLength = 372
[INFO_|01/31 07:19:33]  Recv HEXString : 0000000000000168080700000143E5FC86B6002F20BC400C93C6F000FF000E0600280007020101F001040914B34238DD180028CD6B7801C7000000690000000143E5FC633E002F20B3000C93A3B00105000D06002C0007020101F001040915E64238E618002CCD6B7801C7000000640000000143E5FC43FE002F20AA800C9381700109000F06002D0007020101F001040915BF4238D318002DCD6B7801C70000006C0000000143E5FC20D6002F20A1400C935BF00111000D0600270007020101F001040916394238B6180027CD6B7801C70000006D0000000143E5FBF5DE002F2098400C9336500118000B0600260007020101F0010409174D42384D180026CD6B7801C70000006E0000000143E5FBD2B6002F208F400C931140011C000D06002B0007020101F001040915624238C018002BCD6B7801C70000006F0000000143E5FBAF8E002F2085800C92EB10011E000D06002B0007020101F0010409154C4238A318002BCD6B7801C700000067000700005873

Thanks.

标签： regex hadoop null hive

1条回答

祖国的老花朵

2楼-- · 2019-05-07 14:21

With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (\r, \n, or \r\n), before the data ever gets to the SerDe.

Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using STORED AS TEXTFILE. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.

One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.

The way to do this in Hive is to use a different file_format:

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.

Consequently, setting this value to something like EOR would mean that the input files are split on EOR, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.

You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:

SET textinputformat.record.delimiter=EOR;

CREATE EXTERNAL TABLE ...
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = ...
   "output.regex" = ...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;

In your specific scenario, I can't tell what you might use for textinputformat.record.delimiter instead of EOF, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.

If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.

0人赞添加讨论(0) 举报

Getting Null Values in Hive Create & Load Query wi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间