I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and then insert it in the database.
Log format:
10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
My regex:
preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) (\".*?\") (\".*?\")$/',$log, $matches);
Now when I print:
print_r($matches);
Array
(
[0] => 10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
[1] => 10.1.1.150
[2] => -
[3] => -
[4] => 29/September/2011
[5] => 14:21:49
[6] => -0400
[7] => GET
[8] => /info/
[9] => HTTP/1.1
[10] => 200
[11] => 9955
[12] => "http://www.domain.com/download/"
[13] => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
)
I get: "http://www.domain.com/download/"
and same for user agent. How can I get rid of these "
in the regex? Bonus (Is there any quick way to insert the date/time easily)?
Thanks
To parse an Apache
access_log
log in PHP you can use this regex:To match the Apache
error_log
format, you can use this regex:It matches lines with or without the client:
If you don't want to capture the double quotes, move them out of the capture groups.
Should become:
As alternative you could just post-process the entries with
trim($str, '"')
As I've seen and done so many errneous log parsing, here is a hopefully valid regex, tested on 50k lines of logs without any single diff, knowing that:
It's hard to distinguish between referrer and user-agent, let's just home the
" "
between both is discriminent enough, yet we can find the infamous" "
in the referrer and in the user-agent, so basically, we're screwed here.Hope that's help.
your regexp are wrong. you shoudl use correct regexp
I've tried using a couple of the regexps here Jan 2015, and find that a bad bot is not getting a match in my apache2 log.
The bad bot apache2 line is a BASH hack attempt, and I haven't tried to figure out the regexp correction yet: