I have been working with the multiprocessing
package to speed up some geoprocessing (GIS/arcpy
) tasks that are redundant and need to be done the same for more than 2,000 similar geometries.
The splitting up works well, but my "worker" function is rather long and complicated because the task itself from start to finish is complicated. I would love to break the steps apart down more but I am having trouble passing information to/from the worker function because for some reason ANYTHING that a worker function under multiprocessing uses needs to be passed in explicitly.
This means I cannot define constants in the body of if __name__ == '__main__'
and then use them in the worker function. It also means that my parameter list for the worker function is getting really long - which is super ugly since trying to use more than one parameter also requires creating a helper "star" function and then itertools
to rezip them back up (a la the second answer on this question).
I have created a trivial example below that demonstrates what I am talking about. Are there any workarounds for this - a different approach I should be using - or can someone at least explain why this is the way it is?
Note: I am running this on Windows Server 2008 R2 Enterprise x64.
Edit: I seem to have not made my question clear enough. I am not that concerned with how pool.map
only takes one argument (although it is annoying) but rather I do not understand why the scope of a function defined outside of if __name__ == '__main__'
cannot access things defined inside that block if it is used as a multiprocessing function - unless you explicitly pass it as an argument, which is obnoxious.
import os
import multiprocessing
import itertools
def loop_function(word):
file_name = os.path.join(root_dir, word + '.txt')
with open(file_name, "w") as text_file:
text_file.write(word + " food")
def nonloop_function(word, root_dir): # <------ PROBLEM
file_name = os.path.join(root_dir, word + '.txt')
with open(file_name, "w") as text_file:
text_file.write(word + " food")
def nonloop_star(arg_package):
return nonloop_function(*arg_package)
# Serial version
#
# if __name__ == '__main__':
# root_dir = 'C:\\hbrowning'
# word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
# for word in word_list:
# loop_function(word)
#
## --------------------------------------------
# Multiprocessing version
if __name__ == '__main__':
root_dir = 'C:\\hbrowning'
word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
NUM_CORES = 2
pool = multiprocessing.Pool(NUM_CORES, maxtasksperchild=1)
results = pool.map(nonloop_star, itertools.izip(word_list, itertools.repeat(root_dir)),
chunksize=1)
pool.close()
pool.join()