I'm using pytorch not for network, but just for GPU distance matrix calculation. When I just use one GPU, everything goes perfectly; but when it goes to multi-GPU, Error occurs.
Firstly, I got RuntimeError: CUDA error: initialization error
, and I googled this Error and found a solution: add mp.set_start_method('spawn')
. But new error occurred, this time the error was ValueError: bad value(s) in fds_to_keep
, and I've not found a way to solve this error.
Now I'm confused and don't know how to solve it.
I'm using Ubuntu 16.04, Python 3.6.8 and pytorch 1.0.0.
For no mp.set_start_method('spawn')
, the total traceback is:
Traceback (most recent call last):
File "/home/mc/anaconda3/envs/Lab/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/mc/anaconda3/envs/Lab/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/mc/nfs/Project/Polaris_Toolkit/src/density.py", line 288, in run_kernel
d_ts = gpr.pairwise_distance(data_i, data_j, device=device)
File "/home/mc/nfs/Project/Polaris_Toolkit/src/gpr.py", line 126, in pairwise_distance
data1, data2 = data1.cuda(device), data2.cuda(device)
RuntimeError: CUDA error: initialization error
Here device
is the index of GPU.
And for mp.set_start_method('spawn')
, the traceback is:
~/nfs/Project/Polaris_Toolkit/src/density.py in multi_gpu_density(trans_matrix, mmd, device, workers, threshold_coe, p, batch_size)
438 param = (data, indice, threshold, d, d_ma_row, d_ma_col, p)
439 pr = mp.Process(target=run_kernel, args=param)
--> 440 pr.start()
441 jobs.append(pr)
442 for job in jobs:
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/process.py in start(self)
103 'daemonic processes are not allowed to have children'
104 _cleanup()
--> 105 self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
107 # Avoid a refcycle if the target function holds an indirect
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
--> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
282 def _Popen(process_obj):
283 from .popen_spawn_posix import Popen
--> 284 return Popen(process_obj)
285
286 class ForkServerProcess(process.BaseProcess):
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/popen_spawn_posix.py in __init__(self, process_obj)
30 def __init__(self, process_obj):
31 self._fds = []
---> 32 super().__init__(process_obj)
33
34 def duplicate_for_child(self, fd):
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/popen_fork.py in __init__(self, process_obj)
17 util._flush_std_streams()
18 self.returncode = None
---> 19 self._launch(process_obj)
20
21 def duplicate_for_child(self, fd):
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/popen_spawn_posix.py in _launch(self, process_obj)
57 self._fds.extend([child_r, child_w])
58 self.pid = util.spawnv_passfds(spawn.get_executable(),
---> 59 cmd, self._fds)
60 self.sentinel = parent_r
61 with open(parent_w, 'wb', closefd=False) as f:
~/anaconda3/envs/Lab/lib/python3.6/multiprocessing/util.py in spawnv_passfds(path, args, passfds)
415 args, [os.fsencode(path)], True, passfds, None, None,
416 -1, -1, -1, -1, -1, -1, errpipe_read, errpipe_write,
--> 417 False, False, None)
418 finally:
419 os.close(errpipe_read)
ValueError: bad value(s) in fds_to_keep
I now know that in context fork
, cuda mustn't be initialized before multiprocessing
, but I can't understand the error in spawn
. In fact, I tried to look for fds_to_keep
keyword in multiprocessing
and torch.multiprocessing
source code, but there is no fds_to_keep
in both the two source codes.
So I'd like to know how to solve this problem, and why set_start_method('spawn')
doesn't work, what is fds_to_keep
, where is this keyword.
update
I have found fds_to_keep
error in _posixsubprocess.c
, and I looked _sanity_check_python_fd_sequence
, I understood iter_fd
might cause the error. So I went to multiprocessing/util.py
, in spawnv_passfds
function, I print variable passfds
before and after sort.
original code:
# multiprocessing/util.py
def spawnv_passfds(path, args, passfds):
import _posixsubprocess
passfds = tuple(sorted(map(int, passfds)))
errpipe_read, errpipe_write = os.pipe()
try:
return _posixsubprocess.fork_exec(
args, [os.fsencode(path)], True, passfds, None, None,
-1, -1, -1, -1, -1, -1, errpipe_read, errpipe_write,
False, False, None)
finally:
os.close(errpipe_read)
os.close(errpipe_write)
change:
def spawnv_passfds(path, args, passfds):
print('before', passfds)
import _posixsubprocess
passfds = tuple(sorted(map(int, passfds)))
print('after', passfds)
errpipe_read, errpipe_write = os.pipe()
...
passfds
didn't change, because it is a tuple which can't be sorted.
Then I changed passfds = tuple(sorted(map(int, passfds)))
to passfds = tuple(sorted(map(int, list(passfds))))
, this time the result of passfds
is:
before [77]
after (77,)
before [78, 76, 80, 79]
after (76, 78, 79, 80)
before [78, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 80, 79]
after (75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 78, 79, 80)
It is obvious that there are too many 75s in the 3rd passfds
, so there would appear iter_fd <= prev_fd
, and error occurs.
So now my question is: why does this phenomenon appear and why there would be nothing wrong when using fork
(if I don't use any gpu before multiprocessing
)?