My impression with python multiprocessing is that when you create a new process with multiprocessing.Process()
, it creates an entire copy of your current program in memory and continues working from there. With that in mind, I'm confused by the behaviour of the following script.
WARNING: This script will allocate a large amount of memory! Run it with caution!
import multiprocessing
import numpy as np
from time import sleep
#Declare a dictionary globally
bigDict = {}
def sharedMemory():
#Using numpy, store 1GB of random data
for i in xrange(1000):
bigDict[i] = np.random.random((125000))
bigDict[0] = "Known information"
#In System Monitor, 1GB of memory is being used
sleep(5)
#Start 4 processes - each should get a copy of the 1GB dict
for _ in xrange(4):
p = multiprocessing.Process(target=workerProcess)
p.start()
print "Done"
def workerProcess():
#Sleep - only 1GB of memory is being used, not the expected 4GB
sleep(5)
#Each process has access to the dictionary, even though the memory is shared
print multiprocessing.current_process().pid,bigDict[0]
if __name__ == "__main__":
sharedMemory()
The above program illustrates my confusion - it seems like the dict automatically becomes shared between the processes. I thought to get that behaviour I had to use a multiprocessing manager. Could someone explain what is going on?
On Linux, forking a process doesn't result in twice the memory being occupied immediately. Instead, the page table of the new process will be set up to point to the same physical memory as the old process, and only if one of the processes attempts to do a write to one of the pages, they get actually copied (copy on write, COW). The result is that it appears that both processes have separate memory, but physical memory is only allocated once one of the process actually touches the memory.