Why can't I use pytorch to multi-cuda calculat

2019-07-19 05:49发布

I'm using pytorch not for network, but just for GPU distance matrix calculation. When I just use one GPU, everything goes perfectly; but when it goes to multi-GPU, Error occurs.

Firstly, I got RuntimeError: CUDA error: initialization error, and I googled this Error and found a solution: add mp.set_start_method('spawn'). But new error occurred, this time the error was ValueError: bad value(s) in fds_to_keep, and I've not found a way to solve this error.

Now I'm confused and don't know how to solve it.

I'm using Ubuntu 16.04, Python 3.6.8 and pytorch 1.0.0.

For no mp.set_start_method('spawn'), the total traceback is:

Traceback (most recent call last):
  File "/home/mc/anaconda3/envs/Lab/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/mc/anaconda3/envs/Lab/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mc/nfs/Project/Polaris_Toolkit/src/density.py", line 288, in run_kernel
    d_ts = gpr.pairwise_distance(data_i, data_j, device=device)
  File "/home/mc/nfs/Project/Polaris_Toolkit/src/gpr.py", line 126, in pairwise_distance
    data1, data2 = data1.cuda(device), data2.cuda(device)
RuntimeError: CUDA error: initialization error

Here device is the index of GPU.

And for mp.set_start_method('spawn'), the traceback is:

~/nfs/Project/Polaris_Toolkit/src/density.py in multi_gpu_density(trans_matrix, mmd, device, workers, threshold_coe, p, batch_size)
    438             param = (data, indice, threshold, d, d_ma_row, d_ma_col, p)
    439             pr = mp.Process(target=run_kernel, args=param)
--> 440             pr.start()
    441             jobs.append(pr)
    442         for job in jobs:

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/process.py in start(self)
    103                'daemonic processes are not allowed to have children'
    104         _cleanup()
--> 105         self._popen = self._Popen(self)
    106         self._sentinel = self._popen.sentinel
    107         # Avoid a refcycle if the target function holds an indirect

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
    282         def _Popen(process_obj):
    283             from .popen_spawn_posix import Popen
--> 284             return Popen(process_obj)
    285 
    286     class ForkServerProcess(process.BaseProcess):

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/popen_spawn_posix.py in __init__(self, process_obj)
     30     def __init__(self, process_obj):
     31         self._fds = []
---> 32         super().__init__(process_obj)
     33 
     34     def duplicate_for_child(self, fd):

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/popen_fork.py in __init__(self, process_obj)
     17         util._flush_std_streams()
     18         self.returncode = None
---> 19         self._launch(process_obj)
     20 
     21     def duplicate_for_child(self, fd):

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/popen_spawn_posix.py in _launch(self, process_obj)
     57             self._fds.extend([child_r, child_w])
     58             self.pid = util.spawnv_passfds(spawn.get_executable(),
---> 59                                            cmd, self._fds)
     60             self.sentinel = parent_r
     61             with open(parent_w, 'wb', closefd=False) as f:

~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/util.py in spawnv_passfds(path, args, passfds)
    415             args, [os.fsencode(path)], True, passfds, None, None,
    416             -1, -1, -1, -1, -1, -1, errpipe_read, errpipe_write,
--> 417             False, False, None)
    418     finally:
    419         os.close(errpipe_read)

ValueError: bad value(s) in fds_to_keep

I now know that in context fork, cuda mustn't be initialized before multiprocessing, but I can't understand the error in spawn. In fact, I tried to look for fds_to_keep keyword in multiprocessing and torch.multiprocessing source code, but there is no fds_to_keep in both the two source codes.

So I'd like to know how to solve this problem, and why set_start_method('spawn') doesn't work, what is fds_to_keep, where is this keyword.

update

I have found fds_to_keep error in _posixsubprocess.c, and I looked _sanity_check_python_fd_sequence, I understood iter_fd might cause the error. So I went to multiprocessing/util.py, in spawnv_passfds function, I print variable passfds before and after sort.

original code:

# multiprocessing/util.py

def spawnv_passfds(path, args, passfds):
    import _posixsubprocess
    passfds = tuple(sorted(map(int, passfds)))
    errpipe_read, errpipe_write = os.pipe()
    try:
        return _posixsubprocess.fork_exec(
            args, [os.fsencode(path)], True, passfds, None, None,
            -1, -1, -1, -1, -1, -1, errpipe_read, errpipe_write,
            False, False, None)
    finally:
        os.close(errpipe_read)
        os.close(errpipe_write)

change:

def spawnv_passfds(path, args, passfds):
    print('before', passfds)
    import _posixsubprocess
    passfds = tuple(sorted(map(int, passfds)))
    print('after', passfds)
    errpipe_read, errpipe_write = os.pipe()
    ...

passfds didn't change, because it is a tuple which can't be sorted.

Then I changed passfds = tuple(sorted(map(int, passfds))) to passfds = tuple(sorted(map(int, list(passfds)))), this time the result of passfds is:

before [77]
after (77,)
before [78, 76, 80, 79]
after (76, 78, 79, 80)
before [78, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 80, 79]
after (75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 78, 79, 80)

It is obvious that there are too many 75s in the 3rd passfds, so there would appear iter_fd <= prev_fd, and error occurs.

So now my question is: why does this phenomenon appear and why there would be nothing wrong when using fork (if I don't use any gpu before multiprocessing)?

0条回答
登录 后发表回答