I'm trying to learn Python multiprocessing.
http://docs.python.org/2/library/multiprocessing.html from the example of "To show the individual process IDs involved, here is an expanded example:"
from multiprocessing import Process
import os
def info(title):
print title
print 'module name:', __name__
if hasattr(os, 'getppid'): # only available on Unix
print 'parent process:', os.getppid()
print 'process id:', os.getpid()
def f(name):
info('function f')
print 'hello', name
if __name__ == '__main__':
info('main line')
p = Process(target=f, args=('bob',))
p.start()
p.join()
What exactly am I looking at? I see that def f(name): is called after info('main line') is finished, but this synchronous call would be default anyways. I see that the same process info('main line') is the parent PID of def f(name): but not sure what is 'multiprocessing' about that.
Also, with join() "Block the calling thread until the process whose join() method is called terminates". I'm not clear on what the calling thread would be. In this example what would join() be blocking?
How
multiprocessing
works, in a nutshell:Process()
spawns (fork
or similar on Unix-like systems) a copy of the original program (on Windows, which lacks a realfork
, this is tricky and requires the special care that the module documentation notes).target=
function (see below).Since these are independent processes, they now have independent Global Interpreter Locks (in CPython) so both can use up to 100% of a CPU on a multi-cpu box, as long as they don't contend for other lower-level (OS) resources. That's the "multiprocessing" part.
Of course, at some point you have to send data back and forth between these supposedly-independent processes, e.g., to send results from one (or many) worker process(es) back to a "main" process. (There is the occasional exception where everyone's completely independent, but it's rare ... plus there's the whole start-up sequence itself, kicked off by
p.start()
.) So each createdProcess
instance—p
, in the above example—has a communications channel to its parent creator and vice versa (it's a symmetric connection). Themultiprocessing
module uses thepickle
module to turn data into strings—the same strings you can stash in files withpickle.dump
—and sends the data across the channel, "downwards" to workers to send arguments and such, and "upwards" from workers to send back results.Eventually, once you're all done with getting results, the worker finishes (by returning from the
target=
function) and tells the parent it's done. To make sure everything gets closed and cleaned-up, the parent should callp.join()
to wait for the worker's "I'm done" message (actually an OS-levelexit
on Unix-ish sysems).The example is a little bit silly since the two printed messages take basically no time at all, so running them "at the same time" has no measurable gain. But suppose instead of just printing
hello
,f
were to calculate the first 100,000 digits of π (3.14159...). You could then spawn anotherProcess
,p2
with a different targetg
that calculates the first 100,000 digits of e (2.71828...). These would run independently. The parent could then callp.join()
andp2.join()
to wait for both to complete (or spawn yet more workers to do more work and occupy more CPUs, or even go off and do its own work for a while first).