how to regex apache log date and time into hive

2019-06-04 23:26发布

问题:

i want to put my logfiles into a hive (amazon Athena)

my regex is ok, says the tester: https://regex101.com/r/hF4fP8/11

my create table is this:

CREATE EXTERNAL TABLE IF NOT EXISTS webservicelogs.Test15 (
         `day` int,
         `month` string,
         `year` int,
         `hour` int,
         `minute` int,
         `second` int,
         `offset` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '\[(\d{2})\/([a-zA-Z]{3})\/(\d{4}):(\d{2}):(\d{2}):(\d{2})\s(\+\d{4})]' ) 
LOCATION 's3://getag-athena/Test/' 
TBLPROPERTIES ('has_encrypted_data'='false')

the create table statement works

if i want to select the table this erros occures

SELECT * FROM "webservicelogs"."test15" limit 10;

Your query has the following error(s):

HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns

the Logfiles i want to parse is like this:

85.239.101.101 - - [07/Jan/2016:01:00:00 +0100] "POST /bpwsortsinfo1-3/services/Ortsinfo?wsdl HTTP/1.1" 200 467 "-" "Axis2" 449/1883 23 BP7 0

回答1:

i have answered by myself and a help from a colleague

all the \s ses have to be escaped with another backslash, better: all the special characters which are escaped have to be double escaped thats a java thing

(.*)\\s(.*)\\s(.*)\\s\\[(\\d{2})\\/([a-zA-Z]{3})\\/(\\d{4}):(\\d{2}):(\\d{2}):(\\d{2})\\s(\\+\\d{4})].*?$