I have been working with the multiprocessing
package to speed up some geoprocessing (GIS/arcpy
) tasks that are redundant and need to be done the same for more than 2,000 similar geometries.
The splitting up works well, but my "worker" function is rather long and complicated because the task itself from start to finish is complicated. I would love to break the steps apart down more but I am having trouble passing information to/from the worker function because for some reason ANYTHING that a worker function under multiprocessing uses needs to be passed in explicitly.
This means I cannot define constants in the body of if __name__ == '__main__'
and then use them in the worker function. It also means that my parameter list for the worker function is getting really long - which is super ugly since trying to use more than one parameter also requires creating a helper "star" function and then itertools
to rezip them back up (a la the second answer on this question).
I have created a trivial example below that demonstrates what I am talking about. Are there any workarounds for this - a different approach I should be using - or can someone at least explain why this is the way it is?
Note: I am running this on Windows Server 2008 R2 Enterprise x64.
Edit: I seem to have not made my question clear enough. I am not that concerned with how pool.map
only takes one argument (although it is annoying) but rather I do not understand why the scope of a function defined outside of if __name__ == '__main__'
cannot access things defined inside that block if it is used as a multiprocessing function - unless you explicitly pass it as an argument, which is obnoxious.
import os
import multiprocessing
import itertools
def loop_function(word):
file_name = os.path.join(root_dir, word + '.txt')
with open(file_name, "w") as text_file:
text_file.write(word + " food")
def nonloop_function(word, root_dir): # <------ PROBLEM
file_name = os.path.join(root_dir, word + '.txt')
with open(file_name, "w") as text_file:
text_file.write(word + " food")
def nonloop_star(arg_package):
return nonloop_function(*arg_package)
# Serial version
#
# if __name__ == '__main__':
# root_dir = 'C:\\hbrowning'
# word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
# for word in word_list:
# loop_function(word)
#
## --------------------------------------------
# Multiprocessing version
if __name__ == '__main__':
root_dir = 'C:\\hbrowning'
word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
NUM_CORES = 2
pool = multiprocessing.Pool(NUM_CORES, maxtasksperchild=1)
results = pool.map(nonloop_star, itertools.izip(word_list, itertools.repeat(root_dir)),
chunksize=1)
pool.close()
pool.join()
There is no Restriction, only it have to be a iterable!
Try a
class Container
, for instance:The problem is, at least on Windows (although there are similar caveats with *nix
fork
style of multiprocessing, too) that, when you execute your script, it (to greatly simplify it) effectively ends up as as if you called two blank (shell) processes withsubprocess.Popen()
and then have them execute:one by one as soon as one of those processes finishes with the previous call. That means that your
if __name__ == "__main__"
block never gets executed (because it's not the main script, it's imported as a module) so anything declared within it is not readily available to the function (as it was never evaluated).For the staff outside your function you can at least cheat by accessing your
module
viasys.modules["your_script"]
or even withglobals()
but that works only for the evaluated staff, so anything that was placed within theif __name__ == "__main__"
guard is not available as it didn't even had a chance. That's also a reason why you must use this guard on Windows - without it you'd be executing your pool creation, and other code that you nested within the guard, over and over again with each spawned process.If you need to share read-only data in your multiprocessing functions, just define it in the global namespace of your script, outside of that
__main__
guard, and all functions will have the access to it (as it gets re-evaluated when starting a new process) regardless if they are running as separate processes or not.If you need data that changes then you need to use something that can synchronize itself over different processes - there is a slew of modules designed for that, but most of the time Python's own pickle-based, datagram communicating
multiprocessing.Manager
(and types it provides), albeit slow and not very flexible, is enough.