Bash Command Substitution Giving Weird Inconsisten

2019-05-01 16:56发布

问题:

For some reasons not relevant to this question, I am running a Java server in a bash script not directly but via command substitution under a separate sub-shell, and in the background. The intent is for the subcommand to return the process id of the Java server as its standard output. The fragement in question is as follows:

launch_daemon()
{
  /bin/bash <<EOF
     $JAVA_HOME/bin/java $JAVA_OPTS -jar $JAR_FILE daemon $PWD/config/cl.yml <&- &
     pid=\$!
     echo \${pid} > $PID_FILE
     echo \${pid}   
EOF
}

daemon_pid=$(launch_daemon)

echo ${daemon_pid} > check.out

The Java daemon in question prints to standard error and quits if there is a problem in initialization, otherwise it closes standard out and standard err and continues on its way. Later in the script (not shown) I do a check to make sure the server process is running. Now on to the problem.

Whenever I check the $PID_FILE above, it contains the correct process id on one line.

But when I check the file check.out, it sometimes contains the correct id, other times it contains the process id repeated twice on the same line separated by a space charcater as in:

34056 34056

I am using the variable $daemon_pid in the script above later on in the script to check if the server is running, so if it contains the pid repeated twice this totally throws off the test and it incorrectly thinks the server is not running. Fiddling with the script on my server box running CentOS Linux by putting in more echo statements etc. seems to flip the behavior back to the correct one of $daemon_pid containing the process id just once, but if I think that has fixed it and check in this script to my source code repo and do a build and deploy again, I start seeing the same bad behavior.

For now I have fixed this by assuming that $daemon_pid could be bad and passing it through awk as follows:

mypid=$(echo ${daemon_pid} | awk '{ gsub(" +.*",""); print $0 }')

Then $mypid always contains the correct process id and things are fine, but needless to say I'd like to understand why it behaves the way it does. And before you ask, I have looked and looked but the Java server in question does NOT print its process id to its standard out before closing standard out.

Would really appreciate expert input.

回答1:

Following the hint by @WilliamPursell, I tracked this down in the bash source code. I honestly don't know whether it is a bug or not; all I can say is that it seems like an unfortunate interaction with a questionable use case.

TL;DR: You can fix the problem by removing <&- from the script.

Closing stdin is at best questionable, not just for the reason mentioned by @JonathanLeffler ("Programs are entitled to have a standard input that's open.") but more importantly because stdin is being used by the bash process itself and closing it in the background causes a race condition.

In order to see what's going on, consider the following rather odd script, which might be called Duff's Bash Device, except that I'm not sure that even Duff would approve: (also, as presented, it's not that useful. But someone somewhere has used it in some hack. Or, if not, they will now that they see it.)

/bin/bash <<EOF
if (($1<8)); then head -n-$1 > /dev/null; fi
echo eight
echo seven
echo six
echo five
echo four
echo three
echo two
echo one
EOF

For this to work, bash and head both have to be prepared to share stdin, including sharing the file position. That means that bash needs to make sure that it flushes its read buffer (or not buffer), and head needs to make sure that it seeks back to the end of the part of the input which it uses.

(The hack only works because bash handles here-documents by copying them into a temporary file. If it used a pipe, it wouldn't be possible for head to seek backwards.)

Now, what would have happened if head had run in the background? The answer is, "just about anything is possible", because bash and head are racing to read from the same file descriptor. Running head in the background would be a really bad idea, even worse than the original hack which is at least predictable.

Now, let's go back to the actual program at hand, simplified to its essentials:

/bin/bash <<EOF
cmd <&- &
echo \$!
EOF

Line 2 of this program (cmd <&- &) forks off a separate process (to run in the background). In that process, it closes stdin and then invokes cmd.

Meanwhile, the foreground process continues reading commands from stdin (its stdin fd hasn't been closed, so that's fine), which causes it to execute the echo command.

Now here's the rub: bash knows that it needs to share stdin, so it can't just close stdin. It needs to make sure that stdin's file position is pointing to the right place, even though it may have actually read ahead a buffer's worth of input. So just before it closes stdin, it seeks backwards to the end of the current command line. [1]

If that seek happens before the foreground bash executes echo, then there is no problem. And if it happens after the foreground bash is done with the here-document, also no problem. But what if it happens while the echo is working? In that case, after the echo is done, bash will reread the echo command because stdin has been rewound, and the echo will be executed again.

And that's precisely what is happening in the OP. Sometimes, the background seek completes at just the wrong time, and causes echo \${pid} to be executed twice. In fact, it also causes echo \${pid} > $PID_FILE to execute twice, but that line is idempotent; had it been echo \${pid} >> $PID_FILE, the double execution would have been visible.

So the solution is simple: remove <&- from the server start-up line, and optionally replace it with </dev/null if you want to make sure the server can't read from stdin.


Notes:

Note 1: For those more familiar with bash source code and its expected behaviour than I am, I believe that the seek and close takes place at the end of case r_close_this: in function do_redirection_internal in redir.c, at approximately line 1093:

check_bash_input (redirector);
close_buffered_fd (redirector);

The first call does the lseek and the second one does the close. I saw the behaviour using strace -f and then searched the code for a plausible looking lseek, but I didn't go to the trouble of verifying in a debugger.



标签: linux bash shell