I am writing a shell script which performs a task periodically and on receiving a USR1 signal from another process.
The structure of the script is similar to this answer:
#!/bin/bash
trap 'echo "doing some work"' SIGUSR1
while :
do
sleep 10 && echo "doing some work" &
wait $!
done
However, this script has the problem that the sleep process continues in the background and only dies on its timeout. (note that when USR1 is received during wait $!, the sleep process lingers for its regular timeout, but the periodic echo indeed gets cancelled.) You can for example see the number of sleep processes on your machine using pkill -0 -c sleep
.
I read this page, which suggests killing the lingering sleep in the trap action, e.g.
#!/bin/bash
pid=
trap '[[ $pid ]] && kill $pid; echo "doing some work"' SIGUSR1
while :
do
sleep 10 && echo "doing some work" &
pid=$!
wait $pid
pid=
done
However this script has a race condition if we spam our USR1 signal fast e.g. with:
pkill -USR1 trap-test.sh; pkill -USR1 trap-test.sh
then it will try to kill a PID which was already killed and print an error. Not to mention, I do not like this code.
Is there a better way to reliably kill the forked process when interrupted? Or an alternative structure to achieve the same functionality?
Neither of your scripts terminates sleep
, and you're making it more complicated by sending USR1 using pkill
. As the background job is a fork of the foreground one, they share the same name (trap-test.sh
); so pkill
matches and signals both. This, in an uncertain order, kills the background process (leaving sleep
alive, explained below) and triggers the trap in the foreground one, hence the race condition.
Besides, in the examples you linked, the background job is always a mere sleep x
, but in your script it is sleep 10 && echo 'doing some work'
; which requires the forked subshell to wait sleep
to terminate and conditionally execute echo
. Compare these two:
$ sleep 10 &
[1] 9401
$ pstree 9401
sleep
$
$ sleep 10 && echo foo &
[2] 9410
$ pstree 9410
bash───sleep
So let's start from scratch and reproduce the main issue in a terminal.
$ set +m
$ sleep 100 && echo 'doing some work' &
[1] 9923
$ pstree -pg $$
bash(9871,9871)─┬─bash(9923,9871)───sleep(9924,9871)
└─pstree(9927,9871)
$ kill $!
$ pgrep sleep
9924
$ pkill -e sleep
sleep killed (pid 9924)
Just in case, I disabled job control to partly emulate a non-interactive shell's behavior.
Killing the background job didn't kill sleep
, I needed to terminate it manually. This happened because a signal sent to a process is not automatically broadcasted to its target's children; i.e. sleep
didn't receive the TERM signal at all.
To kill sleep
as well as the subshell, I need to put the background job into a separate process group —which requires job control to be enabled, otherwise all jobs are put into the main shell's process group as seen in pstree
's output above—, and send the TERM signal to it, as shown below.
$ set -m
$ sleep 100 && echo 'doing some work' &
[1] 10058
$ pstree -pg $$
bash(9871,9871)─┬─bash(10058,
10058)───sleep(10059,
10058)
└─pstree(10067,10067)
$ kill --
-$!
$
[1]+ Terminated sleep 100 && echo 'doing some work'
$ pgrep sleep
$
With some refinement and adaptation of this concept, your script looks like:
#!/bin/bash -
set -m
usr1_handler() {
kill -- -$!
echo 'doing some work'
}
do_something() {
trap '' USR1
sleep 10 && echo 'doing some work'
}
trap usr1_handler USR1 EXIT
echo "my PID is $$"
while true; do
do_something &
wait
done
This will print my PID is xxx
(where xxx
is the PID of foreground process) and start looping. Sending a USR1 signal to xxx
(i.e kill -USR1 xxx
) will trigger the trap and cause the background process and its children to terminate. Thus wait
will return and the loop will continue.
If you use pkill
instead it'll work anyway, as the background process ignores USR1.
For further information, see:
- Bash Reference Manual § Special Parameters (
$$
and $!
),
- POSIX
kill
specification (-$!
usage),
- POSIX Definitions § Job Control (how job control is implemented in POSIX shells),
- Bash Reference Manual § Job Control Basics (how job control works in bash),
- POSIX Shell Command Language § Signals And Error Handling,
- POSIX
wait
specification.
You might want to use a function that kills the whole process tree including children, tries to kill it nicely, and kills it by force if niceness isn't working.
Here's the part you can add to your script.
TrapQuit is called on SIGUSR1 or other exit signals received (including CTRL+C).
You can add whatever handling is needed in TrapQuit, or call it on a normal script exit with an exit code.
# Kill process and children bash 3.2+ implementation
# BusyBox compatible version
function IsInteger {
local value="${1}"
#if [[ $value =~ ^[0-9]+$ ]]; then
expr "$value" : "^[0-9]\+$" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo 1
else
echo 0
fi
}
# Portable child (and grandchild) kill function tested under Linux, BSD, MacOS X, MSYS and cygwin
function KillChilds {
local pid="${1}" # Parent pid to kill childs
local self="${2:-false}" # Should parent be killed too ?
# Paranoid checks, we can safely assume that $pid should not be 0 nor 1
if [ $(IsInteger "$pid") -eq 0 ] || [ "$pid" == "" ] || [ "$pid" == "0" ] || [ "$pid" == "1" ]; then
echo "CRITICAL: Bogus pid given [$pid]."
return 1
fi
if kill -0 "$pid" > /dev/null 2>&1; then
# Warning: pgrep is not native on cygwin, must be installed via procps package
if children="$(pgrep -P "$pid")"; then
if [[ "$pid" == *"$children"* ]]; then
echo "CRITICAL: Bogus pgrep implementation."
children="${children/$pid/}"
fi
for child in $children; do
KillChilds "$child" true
done
fi
fi
# Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
if [ "$self" == true ]; then
# We need to check for pid again because it may have disappeared after recursive function call
if kill -0 "$pid" > /dev/null 2>&1; then
kill -s TERM "$pid"
if [ $? != 0 ]; then
sleep 15
kill -9 "$pid"
if [ $? != 0 ]; then
return 1
fi
else
return 0
fi
else
return 0
fi
else
return 0
fi
}
function TrapQuit {
local exitcode="${1:-0}"
KillChilds $SCRIPT_PID > /dev/null 2>&1
exit $exitcode
}
# Launch TrapQuit on USR1 / other signals
trap TrapQuit USR1 QUIT INT EXIT