I am completely new to python and I have a serious problem which I cannot solve.
I have a few log files with identical structure:
[timestamp] [level] [source] message
For example:
[Wed Oct 11 14:32:52 2000] [error] [client] error message
I need to write a program in pure Python which should merge these log files into one file and then sort the merged file by timestamp. After this operation I wish to print this result (the contents of the merged file) to STDOUT
I don't understand how to do this would like help. Is this possible?
You can do this
import fileinput
import re
from time import strptime
f_names = ['1.log', '2.log'] # names of log files
lines = list(fileinput.input(f_names))
t_fmt = '%a %b %d %H:%M:%S %Y' # format of time stamps
t_pat = re.compile(r'\[(.+?)\]') # pattern to extract timestamp
for l in sorted(lines, key=lambda l: strptime(t_pat.search(l).group(1), t_fmt)):
print l,
First off, you will want to use the fileinput
module for getting data from multiple files, like:
data = fileinput.FileInput()
for line in data.readlines():
print line
Which will then print all of the lines together. You also want to sort, which you can do with the sorted keyword.
Assuming your lines had started with [2011-07-20 19:20:12]
, you're golden, as that format doesn't need any sorting above and beyond alphanum, so do:
data = fileinput.FileInput()
for line in sorted(data.readlines()):
print line
As, however, you have something more complex you need to do:
def compareDates(line1, line2):
# parse the date here into datetime objects
# Then use those for the sorting
return cmp(parseddate1, parseddate2)
data = fileinput.FileInput()
for line in sorted(data.readlines(), cmp=compareDates):
print line
For bonus points, you can even do
data = fileinput.FileInput(openhook=fileinput.hook_compressed)
which will enable you to read in gzipped log files.
The usage would then be:
$ python yourscript.py access.log.1 access.log.*.gz
or similar.
As for the critical sorting function:
def sort_key(line):
return datetime.strptime(line.split(']')[0], '[%a %b %d %H:%M:%S %Y')
This should be used as the key
argument to sort
or sorted
, not as cmp
. It is faster this way.
Oh, and you should have
from datetime import datetime
in your code to make this work.
Read the lines of both files into a list (they will now be merged), provide a user defined compare function which converts timestamp to seconds since epoch, call sort with the user defined compare, write lines to merged file...
def compare_func():
# comparison code
lst = []
for line in open("file_1.log", "r"):
for line in open("file_2.log", "r"):
# create compare function from timestamp to epoch called compare_func
lst.sort(cmp=compare_func) # this could be a lambda if it is simple enough
something like that should do it