I originally wrote a nested for loop over a test 3D array in python. As I wanted to apply it to larger array which would take a lot more time, I decided to parallelise using ipyparallel by writing it as a function and using bview.map. This way I could take advantage of multiple cores/nodes on a supercomputer.
However the code is actually slower when sent to the supercomputer. When I profiled, it appears the time is spent mostly on
method 'acquire' of 'thread.lock' objects
which from other stack exchange threads suggests that is due to shared data causing slowdown due to synchronization.
I tried using map instead of map_sync, but time.sleep takes up around the same amount of time in that case.
What is the correct way to be using map or an alternative?
Code snippet where the issue appears to be:
SSIMarray = numpy.zeros((imx,imy,imz))
cenx, ceny, cenz= zip(*itertools.product(range(0,imx), range(0,imy), range(0,imz)))
amr= bview.map_sync(SSIMfunc, cenx, ceny, cenz)
SSIMarray = (numpy.asarray(amr))
and the profiler result
60823909 function calls (60593868 primitive calls) in 201.869 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1089003 113.937 0.000 113.937 0.000 {method 'acquire' of 'thread.lock' objects}
64080 5.223 0.000 6.873 0.000 uuid.py:579(uuid4)
384352 4.933 0.000 5.145 0.000 {cPickle.dumps}
640560 4.526 0.000 6.064 0.000 threading.py:260(__init__)
64019 3.704 0.000 16.941 0.000 asyncresult.py:95(_init_futures)
640560 3.338 0.000 9.402 0.000 threading.py:242(Condition)
64077 3.222 0.000 31.562 0.000 client.py:935(_send)
320327 2.359 0.000 8.756 0.000 _base.py:287(__init__)