Merging and sorting log files in Python

2019-02-09 00:00发布

问题:

I am completely new to python and I have a serious problem which I cannot solve.

I have a few log files with identical structure:

[timestamp] [level] [source] message

For example:

[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] error message

I need to write a program in pure Python which should merge these log files into one file and then sort the merged file by timestamp. After this operation I wish to print this result (the contents of the merged file) to STDOUT (console).

I don't understand how to do this would like help. Is this possible?

回答1:

You can do this

import fileinput
import re
from time import strptime

f_names = ['1.log', '2.log'] # names of log files
lines = list(fileinput.input(f_names))
t_fmt = '%a %b %d %H:%M:%S %Y' # format of time stamps
t_pat = re.compile(r'\[(.+?)\]') # pattern to extract timestamp
for l in sorted(lines, key=lambda l: strptime(t_pat.search(l).group(1), t_fmt)):
    print l,


回答2:

First off, you will want to use the fileinput module for getting data from multiple files, like:

data = fileinput.FileInput()
for line in data.readlines():
    print line

Which will then print all of the lines together. You also want to sort, which you can do with the sorted keyword.

Assuming your lines had started with [2011-07-20 19:20:12], you're golden, as that format doesn't need any sorting above and beyond alphanum, so do:

data = fileinput.FileInput()
for line in sorted(data.readlines()):
    print line

As, however, you have something more complex you need to do:

def compareDates(line1, line2):
   # parse the date here into datetime objects
   NotImplemented
   # Then use those for the sorting
   return cmp(parseddate1, parseddate2)

data = fileinput.FileInput()
for line in sorted(data.readlines(), cmp=compareDates):
    print line

For bonus points, you can even do

data = fileinput.FileInput(openhook=fileinput.hook_compressed)

which will enable you to read in gzipped log files.

The usage would then be:

$ python yourscript.py access.log.1 access.log.*.gz

or similar.



回答3:

As for the critical sorting function:

def sort_key(line):
    return datetime.strptime(line.split(']')[0], '[%a %b %d %H:%M:%S %Y')

This should be used as the key argument to sort or sorted, not as cmp. It is faster this way.

Oh, and you should have

from datetime import datetime

in your code to make this work.



回答4:

Read the lines of both files into a list (they will now be merged), provide a user defined compare function which converts timestamp to seconds since epoch, call sort with the user defined compare, write lines to merged file...

def compare_func():
    # comparison code
    pass


lst = []

for line in open("file_1.log", "r"):
   lst.append(line)

for line in open("file_2.log", "r"):
   lst.append(line)

# create compare function from timestamp to epoch called compare_func

lst.sort(cmp=compare_func)  # this could be a lambda if it is simple enough

something like that should do it