I have writen a program that can be summarized as follows:
def loadHugeData():
#load it
return data
def processHugeData(data, res_queue):
for item in data:
#process it
res_queue.put(result)
res_queue.put("END")
def writeOutput(outFile, res_queue):
with open(outFile, 'w') as f
res=res_queue.get()
while res!='END':
f.write(res)
res=res_queue.get()
res_queue = multiprocessing.Queue()
if __name__ == '__main__':
data=loadHugeData()
p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue))
p.start()
processHugeData(data, res_queue)
p.join()
The real code (especially writeOutput()
) is a lot more complicated. writeOutput()
only uses these values that it takes as its arguments (meaning it does not reference data
)
Basically it loads a huge dataset into memory and processes it. Writing of the output is delegated to a sub-process (it writes into multiple files actually and this takes a lot of time). So each time one data item gets processed it is sent to the sub-process trough res_queue which in turn writes the result into files as needed.
The sub-process does not need to access, read or modify the data loaded by loadHugeData()
in any way. The sub-process only needs to use what the main process sends it trough res_queue
. And this leads me to my problem and question.
It seems to me that the sub-process gets it's on copy of the huge dataset (when checking memory usage with top
). Is this true? And if so then how can i avoid id (using double memory essentially)?
I am using Python 2.6 and program is running on linux.