I have a gigabytes-large log file of in this format:
2016-02-26 08:06:45 Blah blah blah
I have a log parser which splits up the single file log into separate files according to date while trimming the date from the original line.
I do want some form of tee
so that I can see how far along the process is.
The problem is that this method is mind numbingly slow. Is there no way to do this quickly in bash? Or will I have to whip up a little C program to do it?
log_file=server.log
log_folder=logs
mkdir $log_folder 2> /dev/null
while read a; do
date=${a:0:10}
echo "${a:11}" | tee -a $log_folder/$date
done < <(cat $log_file)
Try this awk solution - it should be pretty fast - it shows progress - only one file is kept open - also writes lines that don't start with a date to the current date file so lines are not lost - a default initial date is set to "0000-00-00" in case log starts with lines without dates
any timing comparison would be much appreciated
dir=$1
if [[ -z $dir ]]; then
echo >&2 "Usage: $0 outdir <logfile"
echo >&2 "outdir: directory where output files are created"
echo >&2 "logfile: input on stdin to split into output files"
exit 1
fi
mkdir -p $dir
echo "output directory \"$dir\""
awk -vdir=$dir '
BEGIN {
datepat="[0-9]{4}-[0-9]{2}-[0-9]{2}"
date="0000-00-00"
file=dir"/"date
}
date != $1 && $1 ~ datepat {
if(file) {
close(file)
print ""
}
print $1 ":"
date=$1
file=dir"/"date
}
{
if($1 ~ datepat)
line=substr($0,12)
else
line=$0
print line
print line >file
}
'
head -6 $dir/*
sample input log
first line without date
2016-02-26 08:06:45 0 Blah blah blah
2016-02-26 09:06:45 1 Blah blah blah
2016-02-27 07:06:45 2 Blah blah blah
2016-02-27 08:06:45 3 Blah blah blah
no date line
blank lines
another no date line
2016-02-28 07:06:45 4 Blah blah blah
2016-02-28 08:06:45 5 Blah blah blah
output
first line without date
2016-02-26:
08:06:45 0 Blah blah blah
09:06:45 1 Blah blah blah
2016-02-27:
07:06:45 2 Blah blah blah
08:06:45 3 Blah blah blah
no date line
blank lines
another no date line
2016-02-28:
07:06:45 4 Blah blah blah
08:06:45 5 Blah blah blah
==> tmpd/0000-00-00 <==
first line without date
==> tmpd/2016-02-26 <==
08:06:45 0 Blah blah blah
09:06:45 1 Blah blah blah
==> tmpd/2016-02-27 <==
07:06:45 2 Blah blah blah
08:06:45 3 Blah blah blah
no date line
blank lines
another no date line
==> tmpd/2016-02-28 <==
07:06:45 4 Blah blah blah
08:06:45 5 Blah blah blah
read
in bash is absurdly slow. You can make it faster, but you can probably get more speed up with awk:
#!/bin/bash
log_file=input
log_directory=${1-logs}
mkdir -p $log_directory
awk 'NF>1{d=l"/"$1; $1=""; print > d}' l=$log_directory $log_file
If you really want to print to stdout as well, you can, but if that's going to a tty it is going to slow things down a lot. Just use:
awk '{d=l"/"$1; $1=""; print > d}1' l=$log_directory $log_file
(Note the "1" after the closing brace.)