I have a python script that would traverse a list(>1000 elements), find the variable in a large file and then output the result. I am reading the entire file >1000 times. I tried using multiprocessing, but not of much help. Here's what I am trying to do:
import gzip
from multiprocessing.pool import ThreadPool as Pool
def getForwardIP(clientIP, requestID):
with gzip.open("xyz.log") as infile:
for lines in infile:
line= lines.split(" ")
myRequestID= line[0]
forwardIP= line[1]
if myRequestID==requestID:
print forwardIP
if __name__== "__main__":
pool_size=8
pool= Pool(pool_size)
request_id_list= list()
#request_id_list contains >1000 elements
for id in request_id_list:
pool.apply_async(getForwardIP, ("1.2.3.4.", id, ))
pool.close()
pool.join()
Is there a faster way? Any help will be appreciated. Thanks!
EDIT
(I AM WRITING MY ENTIRE CODE HERE) Thanks everyone for the suggestions. Now I am writing the file into a list rather than reading 1000 times. I tried to multi-process the for loop, but it didn't work. Below is the code:
import gzip
import datetime
from multiprocessing.pool import ThreadPool as Pool
def getRequestID(r_line_filename):
requestIDList= list()
with gzip.open(r_line_filename) as infile:
#r_line_filename is a file with request_id and client_ip
for lines in infile:
line= lines.split(" ")
requestID= line[1].strip("\n")
myclientIP= line[0]
if myclientIP==clientIP:
requestIDList.append(requestID)
print "R line List Ready!"
return(requestIDList)
def getFLineList(fFilename):
fLineList= list()
with gzip.open(fFilename) as infile:
#fFilename is a file with format request_id, forward_ip, epoch time
for lines in infile:
fLineList.append(lines.split())
print "F line list ready!"
return(fLineList)
def forwardIP(lines, requestID):
myrequestID= lines[0]
forwardIP= lines[1]
epoch= int(lines[2].split(".")[0])
timex= datetime.datetime.fromtimestamp(epoch).strftime('%Y-%m-%d %H:%M:%S')
if myrequestID==requestID:
print "%s %s %s"%(clientIP, timex, forwardIP)
if __name__== "__main__":
pool= Pool()
clientIP= "x.y.z.a"
rLineList= getRequestID("rLine_subset.log.gz")
fLineList= getFLineList("fLine_subset.log.gz")
for RID in rLineList:
for lines in fLineList:
pool.apply_async(forwardIP, (lines, RID,))
pool.close()
pool.join()
The multi-processing part is not working. Actually, this one is much slower. If I don't do multi-processing and simply traverse the list, it is faster. Thanks for your help in advance!