I'm running a script that daily downloads, builds and checks a program I'm contributing to. The "check" part implies performing a suit of test runs and comparing results with the reference.
As long as the program finishes (with or without errors), everything is OK (I mean that I can handle that). But in some cases some test run is apparently stuck in an infinite loop and I have to kill it. This is quite inconvenient for a job that's supposed to run unattended. If this happens at some point, the test will not progress any further and, worse, next day a new job will be launched, which might suffer the same problem.
Manually, I can identify the "stuck" process, for instance, with ps -u username
, anything with more than, say, 15 minutes in the TIME column should be killed. Note that this is not just the "age" of the process, but the processing time used. I don't want to kill the wrapper script or the ssh session.
Before trying to write some complicated script that periodically runs ps -u username
, parses the output and kills what needs to be killed, is there some easier or pre-cooked solution?
EDIT:
From the replies in the suggested thread, I have added this line to the user's crontab, which seems to work so far:
10,40 * * * * ps -eo uid,pid,time | egrep '^ *`id -u`' | egrep ' ([0-9]+-)?[0-9]{2}:[2-9][0-9]:[0-9]{2}' | awk '{print $2}' | xargs -I{} kill {}
It runs every half hour (at *:10 and *:40), identifies processes belonging to the user (id -u
in backticks, because $UID
is not available in dash) and with processing time longer than 20 minutes ([2-9][0-9]
), and kills them.
The time parsing is not perfect, it would not catch processes that have been running for several hours and less than 20 minutes, but since it runs every 30 minutes that should not happen.