SGE submitted job doesn't run

2019-07-23 03:13发布

I'm using Sun Grid Engine on my ubuntu 14.04 to queue my jobs to be run on my multicore CPU. I've installed and set up SGE on my system but I have problem when testing it. I've created a "hello_world" dir which contains two shell scripts named "hello_world.sh" & "hello_world_qsub.sh" first including a simple command and second including qsub command to submit the first script file as a job to be run. Here's what "hello_world.sh" includes:

#!/bin/bash

echo "Hello world" > /home/theodore/tmp/hello_world/hello_world_output.txt

And here's what "hello_world_qsub.sh" includes:

#!/bin/bash

qsub \
  -e /home/hello_world/hello_world_qsub.error \
  -o /home/hello_world/hello_world_qsub.log \
  ./hello_world.sh

after giving permission to the second sh file and running it with "./hello_world_qsub.sh" command from the specified dir, the output is reasonable:

Your job 1 ("hello_world.sh") has been submitted

But the output of "qstat" command is frustrating:

    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
    -----------------------------------------------------------------------------------------------------------------
     1 0.50000 hello_worl mhr          qw    05/16/2016 20:26:23                                    1        

And the "state" column always remain on "qw" and never changes to "r".

Here's the output of "qstat -j 1" command:

==============================================================
job_number:                 1
exec_file:                  job_scripts/1
submission_time:            Mon May 16 20:26:23 2016
owner:                      mhr
uid:                        1000
group:                      mhr
gid:                        1000
sge_o_home:                 /home/mhr
sge_o_log_name:             mhr
sge_o_path:                 /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/mhr/hello_world
sge_o_host:                 localhost
account:                    sge
stderr_path_list:           NONE:NONE:/home/hello_world/hello_world_qsub.error
mail_list:                  mhr@localhost
notify:                     FALSE
job_name:                   hello_world.sh
stdout_path_list:           NONE:NONE:/home/hello_world/hello_world_qsub.log
jobshare:                   0
env_list:                   
script_file:                ./hello_world.sh
scheduling info:            queue instance "mainqueue@localhost" dropped because it is temporarily not available
                        All queues dropped because of overload or full

And here's the output of "qhost" command:

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
localhost               -               -     -       -       -       -       -

What should I do to make my jobs run and finish their task?

2条回答
We Are One
2楼-- · 2019-07-23 03:22

From your qhost output, it looks like your machine "localhost" is properly configured in SGE. However, on "localhost" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "localhost".

查看更多
一夜七次
3楼-- · 2019-07-23 03:28

My problem solved. As @Finch_Powers stated the problem was about sge_execd. gridengine-exec was not installed properly. The problem was solved once I reinstalled it.

查看更多
登录 后发表回答