I am using Parallel from joblib in my python to train a CNN. the code structure is like:
crf = CRF()
with Parallel(n_jobs=num_cores) as pal_worker:
for epoch in range(n):
temp = pal_worker(delayed(crf.runCRF)(x[i],y[i]) for i in range(m))
The code can run successfully for 1 or 2 epoch, the then an error occured says (I list the main point I think matters):
......
File "/data_shared/Docker/tsun/software/anaconda3/envs/pytorch04/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 104, in write_array
pickler.file_handle.write(chunk.tostring('C'))
OSError: [Errno 28] No space left on device
"""
The above exception was the direct cause of the following exception:
return future.result(timeout=timeout)
File
......
_pickle.PicklingError: Could not pickle the task to send it to the workers.
I am confused since the disk has a lot of space and the program can run successfully for 1 or 2 epoch. I also tried :
with Parallel(n_jobs=num_cores,temp_folder='/dev/shm/temp') as pal_worker:
since '/dev/shm/temp' has a lot of space but it does not work. Could anyone help please? Thanks a lot!