PyCuda / Multiprocessing Issue on OS X 10.8

I'm working on a project where I distribute compute tasks to multiple python Processes each associated with its own CUDA device.

When spawning the subprocesses, I use the following code:

import pycuda.driver as cuda

class ComputeServer(object):
    def _init_workers(self):
        self.workers = []
        cuda.init()
        for device_id in range(cuda.Device.count()):
            print "initializing device {}".format(device_id)
            worker = CudaWorker(device_id)
            worker.start()
            self.workers.append(worker)

The CudaWorker is defined in another file as follows:

from multiprocessing import Process
import pycuda.driver as cuda

class CudaWorker(Process):
    def __init__(self, device_id):
        Process.__init__(self)
        self.device_id = device_id

    def run(self):
        self._init_cuda_context()
        while True:
            # process requests here

    def _init_cuda_context(self):
        # the following line fails
        cuda.init()
        device = cuda.Device(self.device_id)
        self.cuda_context = device.make_context()

When I run this code on Windows 7 or Linux, I have no issues. When running the code on my MacBook Pro with OSX 10.8.2, Cuda 5.0, and PyCuda 2012.1 I get the following error:

Process CudaWorker-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/tombnorwood/pymodules/computeserver/worker.py", line 32, in run
    self._init_cuda_context()
  File "/Users/tombnorwood/pymodules/computeserver/worker.py", line 38, in _init_cuda_context
    cuda.init()
RuntimeError: cuInit failed: no device

I have no issues running PyCuda scripts without forking new processes on my Mac. I only get this issue when spawning a new Process.

Has anyone run into this issue before?

This is really just an educated guess based on my experienced, but I suspect that the OS X implementation of CUDA (or possibly PyCuda) relies on some APIs that can't be used safely after fork, while the linux implementation does not.* Since the POSIX implementation of multiprocessing uses fork without exec to create child processes, this would explain why it fails on OS X but not linux. (And on Windows, there is no fork, just a spawn equivalent, so this isn't an issue.)

The simplest solution would be to drop multiprocessing. If CUDA and PyCUDA are thread-safe (I don't know if they are), and your code is not CPU-bound (just GPU-bound), you might be able to just drop in threading.Thread in place of multiprocessing.Process and be done with it. Or you could consider one of the other parallel-processing libraries that provide similar APIs to multiprocessing. (There are a few people who use pp only because it always execs…)

However, it's pretty easy to hack up multiprocessing to exec/spawn a new Python interpreter and then do everything Windows-style instead of POSIX-style. (Getting every case right is difficult, but getting one specific use case right is easy.)

Or, if you look at bug #8713, there's some work being done on making this work right in general. And there are working patches. Those patches are for 3.3, not 2.7, so you'd probably need a bit of massaging, but it shouldn't be very much. So, just cp $MY_PYTHON_LIB/multiprocessing.py $MY_PROJECT_DIR/mymultiprocessing.py, patch it, use mymultiprocessing in place of multiprocessing, and add the appropriate call to pick spawn/fork+exec/whatever the mode is called in the latest patch before you do anything else.

* The OP says he suspected the same thing, so I probably don't need to explain this to him, but for future readers: It's not about a difference between Darwin and other Unixes, but about the fact that Apple ships a lot of non-Unix-y mid-level libraries like CoreFoundation.framework, Accelerate.framework, etc. that use unsafe-after-fork functionality (or just assert that they're not being used after a fork because Apple doesn't want to put in the rigorous testing that would be warranted before they could say "as of 10.X, Foo.framework is safe after fork"). Also, if you compare the way OS X and linux deal with graphics and other hardware, there's a lot more mid-level in-each-process-userspace going on in OS X.