How to manage scope using multiprocessing

2019-08-29 11:58发布

问题:

I'm trying to implement a function that uses python multiprocessing in order to speed-up a calculation. I'm trying to create a pairwise distance matrix but the implementation with for loops takes more than 8 hours.

This code seems to work faster but when I print the matrix is full of zeros. When I print the rows in the function it seems to work. I think is a scope problem but I cannot understand how to deal with it.

import multiprocessing
import time
import numpy as np

def MultiProcessedFunc(i,x):
    for j in range(i,len(x)):
        time.sleep(0.08)
        M[i,j] = (x[i]+x[j])/2
    print(M[i,:]) # Check if the operation works
    print('')

processes = []

v = [x+1 for x in range(8000)]
M = np.zeros((len(v),len(v)))

for i in range(len(v)):
    p = multiprocessing.Process(target = MultiProcessedFunc, args =(i,v))
    processes.append(p)
    p.start()

for process in processes:
    process.join()
end = time.time()

print('Multiprocessing: {}'.format(end-start))
print(M)

回答1:

Unfortunately your code wont work written in that way. Multiprocessing spawn separate processes, which means that the memory space are separate! Changes made by one subprocess will not be reflected in the other processes or your parent processes.

Strictly speaking this is not a scoping issue. Scope is something defined inside a single interpreter process.

The module does provide means of sharing memory between processes but this comes at a cost (shared memory is way slower due to locking issues and such.

Now, numpy has a nice feature: it releases the GIL during computation. This means that using multi threading instead of multiprocessing should give you some benefit with little other changes to your code, simply replace import multiprocessing with import threading and multiprocessing.Process into threading.Thread. The code should produce the correct result. On my machine, removing the print statements and the sleep code it runs in under 8 seconds:

Multiprocessing: 7.48570203781
[[1.000e+00 1.000e+00 2.000e+00 ... 3.999e+03 4.000e+03 4.000e+03]
 [0.000e+00 2.000e+00 2.000e+00 ... 4.000e+03 4.000e+03 4.001e+03]
 [0.000e+00 0.000e+00 3.000e+00 ... 4.000e+03 4.001e+03 4.001e+03]
 ...
 [0.000e+00 0.000e+00 0.000e+00 ... 7.998e+03 7.998e+03 7.999e+03]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 7.999e+03 7.999e+03]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 8.000e+03]]

An alternative is to have your subprocesses return the result and then combine the results in your main process.