reusable function: substituting the values returne

Below is the snippet: I'm parsing job log and the output is the formatted result.

def job_history(f):

    def get_value(j,n):
        return j[n].split('=')[1]

    lines = read_file(f)
    for line in lines:
        if line.find('Exit_status=') != -1:
            nLine = line.split(';')
            jobID = '.'.join(nLine[2].split('.',2)[:-1]
            jData = nLine[3].split(' ')
            jUsr = get_value(jData,0)
            jHst = get_value(jData,9)
            jQue = get_value(jData,3)
            eDate = job_value(jData,14)

            global LJ,LU,LH,LQ,LE
            LJ = max(LJ, len(jobID))
            LU = max(LU, len(jUsr))
            LH = max(LH, len(jHst))
            LQ = max(LQ, len(jQue))
            LE = max(LE, len(eDate))

            print "%-14s%-12s%-14s%-12s%-10s" % (jobID,jUsr,eDate,jHst,jQue)

   return LJ,LU,LE,LH,LQ

In principle, I should have another function like this:

def fmt_print(a,b,c,d,e):
    print "%-14s%-12s%-14s%-12s%-10s\n" % (a,b,c,d,e)

to print the header and call the functions like this to print the complete result:

fmt_print('JOB ID','OWNER','E_DATE','R_HOST','QUEUE')
job_history(inFile)

My question is: how can I make fmt_print() to print both the header and the result using the values LJ,LU,LE,LH,LQ for the format spacing. the job_history() will parse a number of log files from the log-directory. The length of the field of similar type will differ from file to file and I don't wanna go static with the spacing (assuming the max length per field) for this as there gonna be lot more columns to print (than the example). Thanks in advance for your help. Cheers!!

PS. For those who know my posts: I don't have to use python v2.3 anymore. I can use even v2.6 but I want my code to be v2.4 compatible to go with RHEL5 default.

Update: 1

I had a fundamental problem in my original script. As I mentioned above that the job_history() will read the multiple files in a directory in a loop, the max_len were being calculated per file and not for the entire result. After modifying unutbu's script a little bit and following xtofl's (if this is what it meant) suggestion, I came up with this, which seems to be working.

def job_history(f):
    result=[]
    for line in lines:
        if line.find('Exit_status=') != -1:
            ....
            ....
            global LJ,LU,LH,LQ,LE
            LJ = max(LJ, len(jobID))
            LU = max(LU, len(jUsr))
            LH = max(LH, len(jHst))
            LQ = max(LQ, len(jQue))
            LE = max(LE, len(eDate))

            result.append((jobID,jUsr,eDate,jHst,jQue))

    return LJ,LU,LH,LQ,LE,result

# list of log files 
inFiles = [ m for m in os.listdir(logDir) ]

saved_ary = []
for inFile in sorted(inFiles):
    LJ,LU,LE,LH,LQ,result = job_history(inFile)
    saved_ary += result

# format printing
fmt_print = "%%-%ds  %%-%ds  %%-%ds  %%-%ds  %%-%ds" % (LJ,LU,LE,LH,LQ)
print_head = fmt_print % ('Job Id','User','End Date','Exec Host','Queue')
print '%s\n%s' % (print_head, len(print_head)*'-')

for lines in saved_ary:
    print fmt_print % lines

I'm sure there are lot other better ways of doing this, so suggestion(s) are welcomed. cheers!!

Update: 2

Sorry for brining up this "solved" post again. Later discovered, I was even wrong with my updated script, so I thought I'd post another update for future reference. Even though it appeared to be working, actually length_data were overwritten with the new one for every file in the loop. This works correctly now.

def job_history(f):

    def get_value(j,n):
        return j[n].split('=')[1]

    lines = read_file(f)
    for line in lines:
        if "Exit_status=" in line:
            nLine = line.split(';')
            jobID = '.'.join(nLine[2].split('.',2)[:-1]
            jData = nLine[3].split(' ')
            jUsr = get_value(jData,0)
            ....

        result.append((jobID,jUsr,...,....,...))
    return result

# list of log files 
inFiles = [ m for m in os.listdir(logDir) ]

saved_ary = []
LJ = 0; LU = 0; LE = 0; LH = 0; LQ = 0

for inFile in sorted(inFiles):
    j_data = job_history(inFile)
    saved_ary += j_data

for ix in range(len(saved_ary)):
    LJ = max(LJ, len(saved_ary[ix][0]))
    LU = max(LU, len(saved_ary[ix][1]))
    ....

# format printing
fmt_print = "%%-%ds  %%-%ds  %%-%ds  %%-%ds  %%-%ds" % (LJ,LU,LE,LH,LQ)
print_head = fmt_print % ('Job Id','User','End Date','Exec Host','Queue')
print '%s\n%s' % (print_head, len(print_head)*'-')

for lines in saved_ary:
    print fmt_print % lines

The only problem is it's taking a bit of time to start printing the info on the screen, just because, I think, as it's putting all the in the array first and then printing. Is there any why can it be improved? Cheers!!

回答1:

Since you don't know LJ, LU, LH, LQ, LE until the for-loop ends, you have to complete this for-loop before you print.

    result=[]
    for line in lines:
        if line.find('Exit_status=') != -1:
            ...
            LJ = max(LJ, len(jobID))
            LU = max(LU, len(jUsr))
            LH = max(LH, len(jHst))
            LQ = max(LQ, len(jQue))
            LE = max(LE, len(eDate))
            result.append((jobID,jUsr,eDate,jHst,jQue))
   fmt="%%-%ss%%-%ss%%-%ss%%-%ss%%-%ss"%(LJ,LU,LE,LH,LQ)
   for jobID,jUsr,eDate,jHst,jQue in result:       
       print fmt % (jobID,jUsr,eDate,jHst,jQue)

The fmt line is a bit tricky. When you use string interpolation, each %s gets replaced by a number, and %% gets replaced by a single %. This prepares the correct format for the subsequent print statements.

回答2:

Since the column header and column content are so closely related, why not couple them into one structure, and return an array of 'columns' from your job_history function? The task of that function would be to

output the header for each colum
create the output for each line, into the corresponding column
remember the maximum width for each column, and store it in the column struct

Then, the prinf_fmt function can 'just'

iterate over the column headers, and print them using the respective width
iterate over the 'rest' of the output, printing each cell with 'the respective width'

This design will separate output definition from actual formatting.

This is the general idea. My python is not that good; but I may think up some example code later...

回答3:

Depending on how many lines are there, you could:

read everything first to figure out the maximum field lengths, then go through the lines again to actually print out the results (if you have only a handful of lines)
read one page of results at a time and figure out maximum length for the next 30 or so results (if you can handle the delay and have many lines)
don't care about the format and output in a csv or some database format instead - let the final person / actual report generator worry about importing it