This question is related to pbs job no output when busy. i.e Some of the jobs I submit produce no output when PBS/Torque is 'busy'. I imagine that it is busier when many jobs are being submitted one after another, and as it so happens, of the jobs submitted in this fashion, I often get some that do not produce any output.
Here're some codes.
Suppose I have a python script called "x_analyse.py" that takes as its input a file containing some data, and analyses the data stored in the file:
./x_analyse.py data_1.pkl
Now, suppose I need to: (1) Prepare N such data files: data_1.pkl, data_2.pkl, ..., data_N.pkl (2) Have "x_analyse.py" work on each of them, and write the results to a file for each of them. (3) Since the analysis of different data files are all independent of each other, I am going to use PBS/Torque to run them in parallel to save time. (I think this is essentially an 'embarrassingly parallel problem'.)
I have this python script to do the above:
import os
import sys
import time
N = 100
for k in range(1, N+1):
datafilename = 'data_%d' % k
file = open(datafilename + '.pkl', 'wb')
#Prepare data set k, and save it in the file
file.close()
jobname = 'analysis_%d' % k
file = open(jobname + '.sub', 'w')
file.writelines( [ '#!/bin/bash\n',
'#PBS -N %s\n' % jobname,
'#PBS -o %s\n' % (jobname + '.out'),
'#PBS -q compute\n' ,
'#PBS -j oe\n' ,
'#PBS -l nodes=1:ppn=1\n' ,
'#PBS -l walltime=5:00:00\n' ,
'cd $PBS_O_WORKDIR\n' ,
'\n' ,
'./x_analyse.py %s\n' % (datafilename + '.pkl') ] )
file.close()
os.system('qsub %s' % (jobname + '.sub'))
time.sleep(2.)
The script prepares a set of data to be analysed, saves it to a file, writes a pbs submit file for analysing this set of data, submits the job to do it, and then moves onto doing the same again with the next set of data, and so on.
As it is, when the script is run, a list of job ids are printed to the standard output as the jobs are submitted. 'ls' shows that there are N .sub files and N .pkl data files. 'qstat' shows that all the jobs are running, with the status 'R', and are then completed, with the status 'C'. However, afterwards, 'ls' shows that there are fewer than N .out output files, and fewer than N result files produced by "x_analyse.py". In effect, no output are produced by some of the jobs. If I were to clear everything, and re-run the above script, I would get the same behaviour, with some jobs (but not necessary the same ones as last time) not producing any output.
It has been suggested that by increasing the waiting time between the submission of consecutive jobs, things improve.
time.sleep(10.) #or some other waiting time
But I feel this is not most satisfactory, because I have tried 0.1s, 0.5s, 1.0s, 2.0s, 3.0s, none of which really helped. I have been told that 50s waiting time seems to work fine, but if I have to submit 100 jobs, the waiting time will be about 5000s, which seems awfully long.
I have tried reducing the number of times 'qsub' is used by submitting a job array instead. I would prepare all the data files as before, but only have one submit file, "analyse_all.sub":
#!/bin/bash
#PBS -N analyse
#PBS -o analyse.out
#PBS -q compute
#PBS -j oe
#PBS -l nodes=1:ppn=1
#PBS -l walltime=5:00:00
cd $PBS_O_WORKDIR
./x_analyse.py data_$PBS_ARRAYID.pkl
and then submit with
qsub -t 1-100 analyse_all.sub
But even with this, some jobs still do not produce output.
Is this a common problem? Am I doing something not right? Is waiting in between job submissions the best solution? Can I do something to improve this?
Thanks in advance for any help.
Edit 1:
I'm using Torque version 2.4.7 and Maui version 3.3.
Also, suppose job with job ID 1184430.mgt1 produces no output and job with job ID 1184431.mgt1 produces output as expected, when I use 'tracejob' on these I get the following:
[batman@gotham tmp]$tracejob 1184430.mgt1
/var/spool/torque/server_priv/accounting/20121213: Permission denied
/var/spool/torque/mom_logs/20121213: No such file or directory
/var/spool/torque/sched_logs/20121213: No such file or directory
Job: 1184430.mgt1
12/13/2012 13:53:13 S enqueuing into compute, state 1 hop 1
12/13/2012 13:53:13 S Job Queued at request of batman@mgt1, owner = batman@mgt1, job name = analysis_1, queue = compute
12/13/2012 13:53:13 S Job Run at request of root@mgt1
12/13/2012 13:53:13 S Not sending email: User does not want mail of this type.
12/13/2012 13:54:48 S Not sending email: User does not want mail of this type.
12/13/2012 13:54:48 S Exit_status=135 resources_used.cput=00:00:00 resources_used.mem=15596kb resources_used.vmem=150200kb resources_used.walltime=00:01:35
12/13/2012 13:54:53 S Post job file processing error
12/13/2012 13:54:53 S Email 'o' to batman@mgt1 failed: Child process '/usr/lib/sendmail -f adm batman@mgt1' returned 67 (errno 10:No child processes)
[batman@gotham tmp]$tracejob 1184431.mgt1
/var/spool/torque/server_priv/accounting/20121213: Permission denied
/var/spool/torque/mom_logs/20121213: No such file or directory
/var/spool/torque/sched_logs/20121213: No such file or directory
Job: 1184431.mgt1
12/13/2012 13:53:13 S enqueuing into compute, state 1 hop 1
12/13/2012 13:53:13 S Job Queued at request of batman@mgt1, owner = batman@mgt1, job name = analysis_2, queue = compute
12/13/2012 13:53:13 S Job Run at request of root@mgt1
12/13/2012 13:53:13 S Not sending email: User does not want mail of this type.
12/13/2012 13:53:31 S Not sending email: User does not want mail of this type.
12/13/2012 13:53:31 S Exit_status=0 resources_used.cput=00:00:16 resources_used.mem=19804kb resources_used.vmem=154364kb resources_used.walltime=00:00:18
Edit 2: For job that produces no output, 'qstat -f' returns the following:
[batman@gotham tmp]$qstat -f 1184673.mgt1
Job Id: 1184673.mgt1
Job_Name = analysis_7
Job_Owner = batman@mgt1
resources_used.cput = 00:00:16
resources_used.mem = 17572kb
resources_used.vmem = 152020kb
resources_used.walltime = 00:01:36
job_state = C
queue = compute
server = mgt1
Checkpoint = u
ctime = Fri Dec 14 14:00:31 2012
Error_Path = mgt1:/gpfs1/batman/tmp/analysis_7.e1184673
exec_host = node26/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 14 14:02:07 2012
Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_7.out
Priority = 0
qtime = Fri Dec 14 14:00:31 2012
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 05:00:00
session_id = 9397
Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman,
PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal
ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb
in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi
n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin,
PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash,
PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp,
PBS_O_QUEUE=compute,PBS_O_HOST=mgt1
sched_hint = Post job file processing error; job 1184673.mgt1 on host node
26/0Unknown resource type REJHOST=node26 MSG=invalid home directory '
/gpfs1/batman' specified, errno=116 (Stale NFS file handle)
etime = Fri Dec 14 14:00:31 2012
exit_status = 135
submit_args = analysis_7.sub
start_time = Fri Dec 14 14:00:31 2012
Walltime.Remaining = 1790
start_count = 1
fault_tolerant = False
comp_time = Fri Dec 14 14:02:07 2012
as compared with a job that produces output:
[batman@gotham tmp]$qstat -f 1184687.mgt1
Job Id: 1184687.mgt1
Job_Name = analysis_1
Job_Owner = batman@mgt1
resources_used.cput = 00:00:16
resources_used.mem = 19652kb
resources_used.vmem = 162356kb
resources_used.walltime = 00:02:38
job_state = C
queue = compute
server = mgt1
Checkpoint = u
ctime = Fri Dec 14 14:40:46 2012
Error_Path = mgt1:/gpfs1/batman/tmp/analysis_1.e118468
7
exec_host = ionode2/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 14 14:43:24 2012
Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_1.out
Priority = 0
qtime = Fri Dec 14 14:40:46 2012
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 05:00:00
session_id = 28039
Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=batman,
PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal
ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb
in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi
n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin,
PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash,
PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp,
PBS_O_QUEUE=compute,PBS_O_HOST=mgt1
etime = Fri Dec 14 14:40:46 2012
exit_status = 0
submit_args = analysis_1.sub
start_time = Fri Dec 14 14:40:47 2012
Walltime.Remaining = 1784
start_count = 1
It appears that the exit status for one is 0 but not the other.
Edit 3:
From 'qstat -f' outputs like the ones above, it seems that the problem has something to do with 'Stale NFS file handle'in the post job file processing. By submitting hundreds of test jobs, I have been able to identify a number of nodes that produce failed jobs. By ssh
ing onto these, I can find the missing PBS output files in /var/spool/torque/spool
, where I can also see output files belonging to other users. One strange thing about these problematic nodes is that if they are the only node chosen to be used, the job runs fine on them. The problem only arises when they are mixed with other nodes.
Since I do not know how to fix the post job processing 'Stale NFS file handle', I avoid these nodes by submitting 'dummy' jobs to them
echo sleep 60 | qsub -lnodes=badnode1:ppn=2+badnode2:ppn=2
before submitting the real jobs. Now all jobs produce output as expected, and there is no need to wait before consecutive submissions.