可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.
I wanted to split it in many small files, by using date as splitting criteria.
Date is in format [15/Oct/2011:12:02:02 +0000]
. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?
Input file name is access.log
. I'd like output files to have format such as access.apache.15_Oct_2011.log
(that would do the trick, although not nice when sorting.)
回答1:
One way using awk
:
awk 'BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = a
}
{
split($4,array,"[:/]");
year = array[3]
month = sprintf("%02d", m[array[2]])
print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009
This will output files like:
incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt
Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.
Original inspiration: "How to split existing apache logfile by month?"
回答2:
Pure bash, making one pass through the access log:
while read; do
[[ $REPLY =~ \[(..)/(...)/(....): ]]
d=${BASH_REMATCH[1]}
m=${BASH_REMATCH[2]}
y=${BASH_REMATCH[3]}
#printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[@]:1:3}
printf -v fname "access.apache.%s_%s_%s.log" $y $m $d
echo "$REPLY" >> $fname
done < access.log
回答3:
Perl came to the rescue:
cat access.log | perl -n -e'm@\[(\d{1,2})/(\w{3})/(\d{4}):@; open(LOG, ">>access.apache.$3_$2_$1.log"); print LOG $_;'
Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.
I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.
回答4:
Here is an awk
version that outputs lexically sortable log files.
Some efficiency enhancements: all done in one pass, only generate fname
when it is not the same as before, close fname
when switching to a new file (otherwise you might run out of file descriptors).
awk -F"[]/:[]" '
BEGIN {
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
if($4 != pyear || $3 != pmonth || $2 != pday) {
pyear = $4
pmonth = $3
pday = $2
if(fname != "")
close(fname)
fname = sprintf("access_%04d_%02d_%02d.log", $4, m2n[$3], $2)
}
print > fname
}' access-log
回答5:
Kind of ugly, that's bash for you:
for year in 2010 2011 2012; do
for month in jan feb mar apr may jun jul aug sep oct nov dec; do
for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
done
done
done
回答6:
I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.
awk '
BEGIN {
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
split($4, a, "[]/:[]")
if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
pyear = a[4]
pmonth = a[3]
pday = a[2]
if(fname != "")
close(fname)
fname = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
}
print >> fname
}'
回答7:
I made a slight improvement to Theodore's answer so I could see progress when processing a very large log file.
#!/usr/bin/awk -f
BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = a
}
{
split($4, array, "[:/]")
year = array[3]
month = sprintf("%02d", m[array[2]])
current = year "-" month
if (last != current)
print current
last = current
print >> FILENAME "-" year "-" month ".txt"
}
Also I found that I needed to use gawk
(brew install gawk
if you don't have it) for this to work on Mac OS X.