I'm trying to run a very large set of batch jobs on a RHEL5 cluster which uses a Lustre file system. I was getting a strange error with roughly 1% of the jobs: they could't find a text file they are all using for steering. A script that reproduces the error looks like this:
#!/usr/bin/env bash
#PBS -t 1-18792
#PBS -l mem=4gb,walltime=30:00
#PBS -l nodes=1:ppn=1
#PBS -q hep
#PBS -o output/fit/out.txt
#PBS -e output/fit/error.txt
cd $PBS_O_WORKDIR
mkdir -p output/fit
echo 'submitted from: ' $PBS_O_WORKDIR
files=($(ls ./*.txt | sort)) # <-- NOTE THIS LINE
cat batch/fits/fit-paths.txt
For some small fraction of jobs, the error stream output would show:
cat: batch/fits/fit-paths.txt: No such file or directory
Weird enough, but it gets stranger.
When I change the files=($(ls ./*.txt | sort))
line to
files=($(ls batch/fits/*.txt | sort))
The jobs run without errors! Needless to say, this is far from satisfying: I'd rather not have my jobs depend on black magic (although black magic is better than no magic).
Any idea what's going on here?