I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",
"output.format.string" = "%1$s %2$s"
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;
Please correct me here.
Sample Log:
[INFO_|01/31 07:19:29] IMEI 356307043180842
[INFO_|01/31 07:19:33] PacketLength = 372
[INFO_|01/31 07:19:33] Recv HEXString : 0000000000000168080700000143E5FC86B6002F20BC400C93C6F000FF000E0600280007020101F001040914B34238DD180028CD6B7801C7000000690000000143E5FC633E002F20B3000C93A3B00105000D06002C0007020101F001040915E64238E618002CCD6B7801C7000000640000000143E5FC43FE002F20AA800C9381700109000F06002D0007020101F001040915BF4238D318002DCD6B7801C70000006C0000000143E5FC20D6002F20A1400C935BF00111000D0600270007020101F001040916394238B6180027CD6B7801C70000006D0000000143E5FBF5DE002F2098400C9336500118000B0600260007020101F0010409174D42384D180026CD6B7801C70000006E0000000143E5FBD2B6002F208F400C931140011C000D06002B0007020101F001040915624238C018002BCD6B7801C70000006F0000000143E5FBAF8E002F2085800C92EB10011E000D06002B0007020101F0010409154C4238A318002BCD6B7801C700000067000700005873
Thanks.
With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (
\r
,\n
, or\r\n
), before the data ever gets to the SerDe.Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using
STORED AS TEXTFILE
. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.
The way to do this in Hive is to use a different file_format:
TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.
Consequently, setting this value to something like
EOR
would mean that the input files are split onEOR
, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:
In your specific scenario, I can't tell what you might use for
textinputformat.record.delimiter
instead ofEOF
, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.