Checking fork behaviour in python multiprocessing

2019-05-25 06:45发布

问题:

I have to access a set of large and not pickable python objects from many processes. Therefore, I would like to ensure that these objects are not copied completely.

According to comments in this and this post, objects are not copied (on unix systems) unless they are changed. However, referencing an object will change its reference count, which in turn will then be copied.

Is this correct so far? Since my concern is due to the size of my large objects, I do not have a problem, if small parts of these objects are copied.

To ensure that I understood everything correctly and that nothing unexpected happens, I implemented a small test program:

from multiprocessing import Pool

def f(arg):
    print(l, id(l), object.__repr__(l))
    l[arg] = -1
    print(l, id(l), object.__repr__(l))

def test(n):
    global l
    l = list(range(n))
    with Pool() as pool: 
        pool.map(f, range(n))
    print(l, id(l), object.__repr__(l))

if __name__ == '__main__':
    test(5) 

In the first line of f, I would expect id(l) to return the same number in all function calls, since the list is not changed before the id check.

On the other hand, in the third line of f, id(l) should return a different number in each method call, since the list is changed in the second line.

However, the program output puzzles me.

[0, 1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[-1, 1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, -1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, -1, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, -1, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, 3, 4] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, 3, -1] 139778408436488 <list object at 0x7f20b261d308>
[0, 1, 2, 3, 4] 139778408436488

The id is the same in all calls and lines of f. This is the case even though the list remains unchanged at the end (as expected), which implies that the list has been copied.

How can I see whether an object has been copied or not?

回答1:

Your confusion seems to be cause by misunderstanding how processes and fork work. Each process has its own address space and so two processes can use the same addresses without conflict. This also means a process can't access the memory of another process unless the same memory is mapped into both processes.

When a process invokes the fork system call, the operating system creates a new child process that's a clone of the parent process. This clone, like any other process, has it's own address space distinct from its parent. However the contents of the address space are an exact copy of the parent's. This used to be accomplished by copying the memory of the parent process into new memory allocated for the child. This means once the child and parent resume executing after the fork any modifications either process makes to their own memory doesn't affect the other.

However, copying the entire address space of a process is an expensive operation, and is usually a waste. Most of the time the new process immediately executes a new program which results in the child's address space being replaced completely. So instead modern Unix-like operating systems use a "copy-on-write" fork implementation. Instead of copying the memory of the parent process the parent's memory is mapped into the child so they can share the same memory. However, the old semantics are still maintained. If either the child or the parent modify the shared memory then the page modified is copied so that the two processes no longer share that page of memory.

When the multiprocessing module calls your f function it does so in a child process that was created by using the fork system call. Since this child process is a clone of the parent, it also has a global variable named l which refers to a list which has the same ID (address) and same contents in both processes. That is, until you modify the list referred by l in the child process. The ID doesn't (and can't) change, but child's version of the list is no longer the same as the parent's. The contents of the parent's list are unaffected the modification made by the child.

Note that behaviour described in previous paragraph is true whether fork uses copy-on-write or not. As far as the multiprocessing module and Python in general are concerned that's just an implementation detail. The effective result is the same regardless. This mean you can't really test in a Python program which fork implementation is used.